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Preface 



The Mexican International Conference on Artificial Intelligence (MICAI) is a 
biennial conference established to promote research inartificial intelligence (AI), 
and cooperation among Mexican researchersand their peers worldwide. MICAI 
is organized by the Mexican Societyfor Artificial Intelligence (SMIA), in collabo- 
ration with the AmericanAssociation for Artificial Intelligence (AAAI) and the 
Mexican Society for Computer Science (SMCC). 

After two successful conferences, we are pleased to present the 3rd Mexican 
International Conference on Artificial Intelligence, MICAI2004, which took place 
on April 26-30, 2004, in Mexico City, Mexico. This volume contains the papers 
included in the conferencemain program, which was complemented by tutorials 
and workshops, published in supplementary proceedings. The proceedings of 
past MICAI conferences, 2000 and 2002, were also published in Springer- Verlag’s 
Lecture Notes in Artificial Intelligence (LNAI) series, volumes 1793 and 2313. 

The number of submissions to MICAI 2004 was significantly higher than 
those of previous conferences — 254 papers from 19 different countries were 
submitted for consideration to MICAI 2004. The evaluation of this unexpectedly 
large number of papers was a challenge, both in terms of the quality of the papers 
and of the review workload of each PC member. After a thorough reviewing 
process, MICAI’s Program Committee and Programs Chairs accepted 97 high- 
quality papers. So the acceptance rate was 38.2%. CyberChair, a free Web-based 
paper submission and reviewing system, was used as an electronic support for 
the reviewing process. 

This book contains revised versions of the 94 papers presented at the confe- 
rence. The volume is structured into 13 thematic fields according to the topics 
addressed by the papers, which are representative of the main current area of 
interest within the AI community. 

We are proud of the quality of the research presented at MICAI 2004, and 
hope that this volume will become an important archival reference for the field. 



April 2004 Raul Monroy 

Gustavo Arroyo-Figueroa 
Luis Enrique Sucar 
Humberto Sossa 
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Abstract. Most modern lossless data compression techniques used to- 
day, are based in dictionaries. If some string of data being compressed 
matches a portion previously seen, then such string is included in the 
dictionary and its reference is included every time it appears. A possi- 
ble generalization of this scheme is to consider not only strings made of 
consecutive symbols, but more general patterns with gaps between its 
symbols. The main problems with this approach are the complexity of 
pattern discovery algorithms and the complexity for the selection of a 
good subset of patterns. In this paper we address the last of these prob- 
lems. We demonstrate that such problem is NP-complete and we provide 
some preliminary results about heuristics that points to its solution. 

Categories and Subject Descriptors: E.4 [Coding and Informa- 
tion TheoryJ-dfflta compaction and compression', F.2.2 [Analysis of 
Algorithms and Problem Complexity]: Nonnumerical Problems; 
1.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and 
Search- heuristic methods. 

General Terms: Algorithms, Theory 

Additional Keywords and Phrases: Genetic algorithms, optimiza- 
tion, NP-hardness 



1 Introduction 

Since the introduction of information theory in Shannon’s seminal paper [7] 
the efficient representation of data is one of its fundamental subjects. Most of 
the successful modern lossless methods used today are dictionary-based such 
as the Lempel-Ziv family [10]. These dictionary-based methods only consider 
strings of consecutive symbols. In this paper we will introduce a generalization 
by considering “strings” with gaps, whose symbols are no necessarily consecutive. 
This generalization will be called pattern in the rest of paper. A similar concept 
is used in [1] for approximate string matching. 

In this context a pattern is a finite and ordered sequence of symbols together 
with an specification of the position of each symbol. For example a pattern 
contained in the string It is better late than never could be: 

Pi =I be t a. 

R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 1-10, 2004. 
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2 A. Kuri and J. Galaviz 

We consider two possible representations of patterns: 

— An ordered pair (S,0) where S is the ordered sequence of symbols in the 
pattern and O is the ordered sequence of absolute positions (offsets) of sym- 
bols. 

— A 3-tuple (S,D,b) where S is the ordered sequence of symbols in the pattern, 
b is the absolute position of the first symbol in S and D is the ordered 
sequence of distances between symbols (positions relative to the previous 
symbol) . 

Therefore, the pattern used as an example above could be represented as: 
S = {I, b, e, t, a}, O = {0, 6, 7, 15, 20} or using: p = 0, = {6, 0, 8, 5}. In general 

the second representation is more efficient than the first one, since absolute 
positions are potentially larger than the relative positions. 

By identifying frequent patterns we can proceed to include such patterns in 
a dictionary achieving data compression. 



2 Pattern-Based Data Compression 

In order to apply the procedure aforementioned we need to accomplish several 
goals: 

1. Given a sample of data S, identify frequent patterns. 

2. Given the set of patterns obtained in the previous step, determine the subset 
of such patterns that maximizes the compression ratio. 

3. Determine the best way for pattern representation and encoding, and the 
best way for reference encoding. The compressed sample will include the, 
perhaps compressed, dictionary and the compressed representation of origi- 
nal sample. 

This paper will be focused in the second step, but we will do some annotations 
regarding the first. 

In the first step it is needed to find patterns in the sample. Such patterns 
must be used frequently, that is, they must appear several times in ths sample. 
Evidently the most frequent patterns will be individual symbols. Therefore, we 
need to establish another requisite: the patterns must be as large as possible. 
But there is a potential conflict between the pattern size (number of symbols 
that contains) and its frequency. Larger patterns are rare, short patterns are 
common. In order to avoid this conflict, we will consider the total number of 
symbols that are contained in all the appearances of a given pattern p. This 
number will be called the coverage of such pattern and can be calculated by the 
product of the pattern frequency and the pattern size. In notation: 

Cov (p) = f{p)t{p) (I) 



where f{p) denotes the frequency of pattern p and t{p) the number of symbols 
in p. In our example of previous section the pattern size is t{pi) = 5. 
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It is convenient to distinguish between the nominal coverage of some pattern: 
the product mentioned before, and the effective coverage of some pattern in a 
set of patterns: the number of symbols contained in all the appearances of the 
pattern and not contained in any previously appeared pattern. 

The task for this first step must be solved by algorithms of pattern discovery 
algorithms, similar to those used in the analysis of DNA sequences and biomolec- 
ular data in general. Unfortunately the reported algorithms for the discovery of 
patterns of the kind we are interested in, have exponential time complexity [9] . 



3 Selecting Good Subset of Patterns 



The second step of the aforementioned process is the selection of a good subset 
of the whole set of frequent patterns found in the sample. The “goodness” of a 
subset of patterns is given by the compression ratio obtained if such subset of 
patterns is the dictionary. 

Obviously the best subset also must cover a considerably large amount of 
symbols contained in the sample, since every pattern in the subset must have 
a good coverage. Also it must have a low amount of data symbols multiply 
covered. That is, the best subset must have efficient patterns: a large amount of 
covered symbols but a small amount of symbols covered by another patterns or 
by different appearances of pattern itself. 

Let S' be a sample of data. We will denote by |S| the number of symbols 
in such sample (its original size). We will denote with T{Q) the size of the 
compressed sample using the subset of patterns Q. With P we will denote the 
whole set of patterns found in S, therefore Q C P. Using this notation we will 
define the compression ratio. 

Definition 1. The compression ratio obtained by using the subset of patterns 
Q is: 

G{Q) = 1 - ^ (2) 

T(Q) has two components: the dictionary size D{Q) and the representation 
of the sample itself E{Q). These are given by: 



Hence: 



D{Q) = z 



t{Pz) + r{Q) 

PieQ 



E{Q) — 1 -|- I] f{P^) 

Pi^Q 



( 3 ) 

( 4 ) 



T{Q) = D{Q) + E{Q) (5) 

In the expression for D, z > 1: since for every pattern included in the dictio- 
nary, every symbol in the pattern must also appear. The distances between them 
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must be included and the offset of first symbol in the pattern must be included 
as well. We will assume that every distance or offset uses z times the space (bits 
for example) used for each symbol. In what follows z will be roughly estimated 
as z = 2, assuming that, for every symbol in a given pattern we need to store 
the symbol itself and its distance to the previous symbol in the same pattern, 
and both requires the same amount of data. 

In expression 3, r(Q) is the number of symbols not covered by patterns in 
Q. Such symbols must appear as a pattern in the dictionary. Once a pattern 
subset Q is chosen, r(Q) is determined. Hence, we can equivalently think that r 
is included in the sum. 

In the expression for E the 1 that is added to the sum at the right represents 
the inclusion of the “pattern” of those symbols in S not contained in any pattern 
of Q in the dictionary. This is the pattern with the r{Q) symbols just mentioned. 
E{Q) is the number of pattern identifiers used to represent the sample in terms of 
references. For every appearance of a pattern in Q, a reference must be included 
in the compressed sample expression. As stated a reference to the “pattern” of 
symbols not covered by Q must also be included in such expression. 

Denoting by pi the f-th pattern contained in a set of patterns, the coverage 
of a subset of patterns Q is: 

Cov(g) = f{p^) t{pi) (6) 

PidQ 



where f{pi) and t{pi) are the frequency and size of pattern pi respectively. 

We can now state our problem as follows: 

Definition 2. OptimalPatternSubsetProblem. 

Given: 

~ A data sample S with [S'! symbols. 

~ A set P = {pi, . . . ,pn} of frequent patterns of S with frequencies f{pi) = fi 
and sizes t{pi) = U 

Find a subset Q C P that maximizes G{Q) subject to the restriction: 

Cov(Q) < 1^1 (7) 

Hence we must find a subset of P. But there are 2 1^1 of such subsets. This 
is a huge search space even for small values of |P|. In fact this is a NP-complete 
problem as we now prove. Similar problems have been proved NP-complete in 
[8,4]. However in [8] the dictionary size is not considered, and the patterns are 
strings of consecutive symbols. In [4] also the dictionary size is ignored and 
coverage of patterns is used as the only criteria to determine the best subset. 

Theorem 1. OptimalPatternSubsetProblem is NP-complete. 

Proof. First we must prove that OPSP is in NP, which means that this is verifiable 
in polynomial time. Given some subset Q Q P and the maximum compression 
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ratio g, then we can calculate T(Q) in linear time on the size of Q, therefore we 
can calculate G{Q) also in 0(|(5|). That is polynomial time verifiable. 

Next, in order to do the reduction we chose the (0,1) Knapsack Problem. 
We must prove that any given instance of such problem can be mapped, by a 
polynomial time algorithm, in an instance of OPSP. The solution to knapsack 
instance is therefore mapped in polynomial time to the solution of some OPSP 
instance. 

In an instance of (0,1) knapsack problem there are given: 

— A set of objects {oi, 02 , . . . , Om}- 

— A function v that assigns to every object Oi its value: v{oi) > 0. 

— A function w that assigns to every object Oi its weight: w{oi) > 0. 

~ A positive integer C > 0 called the capacity. 

The problem consists in finding some subset B G O such that it maximizes: 

OiGB 



with the restriction: 

w{o^) < C 

Oi€B 

Let B GO. The algorithm proceeds as follows. 
For every object oi € B compute: 



w'{oi) 



Moi)-cr 

8w(oi) 

w(Oi), 



otherwise 



This means that: 



8w'(oi) < (v(oi) - C)" 



(8) 



Now we establish the values of for sample size, the frequencies, and the sizes 
on the corresponding instance of OPSP: 






f(Oi) = 
t(Oi) = 



(Cb - p(o»)) + y (f (oQ - Gb)^ - 8 w'(o») 
2 

u'(Oi) 



f(Oi) 



where (8) guarantees that the square root of (10) is real solution of: 
/^(oj) + f(oi) (v(oi) - Gb) + 2 w'(o^) = 0 
From this we obtain: 

Gb f{oi) - 2w{oi) - P{oi) 



ioi) = 



f{Oi) 



(9) 

( 10 ) 

( 11 ) 
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Finally, using (9) and (11): 
Also from (11) we have: 



w'{Oi) = f{Oi)t{Oi) 



Therefore, in the knapsack we want to maximize: 

1^ - (2t(oi) + f{oi)) 

= c- J2ino^) + f{o^)) 

OiGB 






(12) 



(13) 



(14) 

(15) 



If we establish [S'! = C for OPSP, then maximizing the last expression also 
maximizes: 

_ 1 A /(oi)) 

^ ^ 

which is (2) considering (3) and (4) (excluding the terms related with the “pat- 
tern” of symbols not included in any other pattern). 

The restriction is transformed as follows: 



E, 



Y w'{oi) = Y ^ Y '^( a ) < c = is *! 

OiSB OiGS OiGS 

In terms of OPSP the restriction means that the joint coverage of patterns in B 
cannot be greater than the total size of the original sample. 

Therefore the solution to knapsack, using the transformations above, can be 
mapped into the solution of OPSP in polynomial time. 

□ 



Since OPSP is NP-complete we need heuristics in order to obtain an approxi- 
mate solution for large samples. In what follows we will address such heuristics. 



4 Heuristics for Subset Selection 

4.1 Using a Genetic Algorithm 

Our first heuristic approach was the use of a genetic algorithm (ga) for the 
selection of a good pattern subset. The use of GA to solve NP-complete problems 
has been used in the past [2,3]. 

The first step is to define a useful domain representation. Since we need to 
find a subset of a set of m different patterns. We can encode every subset using 
a binary string of length m. The f-th pattern is included in a subset if the t-th 
bit of its encoding string is “1”, otherwise the pattern is excluded. The number 
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of possible binary strings of length m is 2™, the cardinality of our search space: 
the power set. 

The fitness function is given by (2), and the selection will be performed using 
a deterministic scheme called Vasconcelos Selection [5]^. Such selection scheme 
has improved the performance of GA when combined with 2-point crossover, as 
is statistically proved in [6]. 



4.2 Using a Coverage-Based Heuristic 

We propose an alternative method for the search of a good subset of patterns. 
The method consists in the selection of patterns using the number of symbols in 
the sample that are covered by them. A pattern is better the greater the number 
of symbols in the sample that are in the instances of such pattern. This is our 
concept of coverage. It may occur that some symbol is in the appearances of 
several different patterns. That is, the symbol is “overcovered” . Hence we need 
to measure the coverage more accurately than with the product of frequency 
and size. Therefore, during the selection process, if some pattern covers symbols 
already covered by a pattern previously selected, the coverage of the last pattern 
should be modified to represent its effective coverage: the number of symbols that 
are covered by the pattern and not already covered by some other pattern. 

Given a set of patterns P we will obtain a subset B C P of selected patterns. 
The heuristic algorithm for the selection of subset is: 

1. Set B = 0 

2. Set the current coverage Cv = 0. 

3. Select the pattern p £ P with highest effective coverage (covered symbols 
not already in Cv). 

4. Add p to B. 

5. Remove p from P 

6. Use the coverage of p to update the current coverage Cv. 

7. Return to 3 until \Cv\ equals the sample size or P = 0. 

With the strategy described we will obtain the subset of patterns B with 
highest coverage. But good coverage does not guarantee best compression ratio, 
it may occur that patterns with large size and poor frequency or conversely are 
included in B. Since we want that the inclusion of some pattern in the dictionary 
will be well amortized by its use in the sample, this is not desirable. 



4.3 Hillclimbing 

For the improvement of the heuristic described above, we will use hillclimbing. 
Two different hillclimbers will be defined: 



^ The best individual (Jq) is mixed with the worst one (7jv-i), the second best is 
crossed with the second worst (Ji and In- 2 , respectively), and so on. 
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MSHC Minimum Step HillClimher. Searching the better binary string in the 
neighborhood of radius 1 in Hamming distance. Given a binary string that 
encodes a subset, we perform a search for the best string between those that 
differ from the given one in only one bit. 

MRHC Minimum Replacement HillClimber. Searching the better binary string 
in the neighborhood of radius 2 in Hamming distance. Given a binary string, 
we perform a search for the best string between those that exchanges the 
position of every bit with value 1 with the position with value 0. 

To find a better subset than the one given by the coverage-based heuristic 
we run both hillclimbers: the string obtained by the heuristic is passed to MSHC, 
and the output of this climber is thereafter passed to mrhc. Every hillclimber 
is executed interatively until no further improvement is obtained. 



5 Test Cases 

In order to test the effectiveness of the heuristic methods above we define three 
simple test cases. These are only for testing, but a more robust statistical demon- 
stration of effectiveness will be developed in the future. 

The test cases are samples built with patterns defined a priori. Hence the 
expected resulting subset is the set of building patterns. The algorithm used for 
pattern discovery yields the whole set of patterns which is large in comparison 
with the subset we are looking for. The output of such algorithm is the input 
for the subset selection algorithm. 

The characteristics of each sample are shown in table 1. The column la- 
beled Patterns show the number of building patterns used for the sample. The 
column labeled ComRa contains the compression ratio obtained if uilding pat- 
terns are the whole dictionary. The column Patt. found contains the number 
of patterns found by the discovery algorithm, therefore the search space depends 
exponentially on the contents of this column. 

For the genetic algorithm we use the parameter values shown in table 2. The 
GA was ran for 100 generations. A repair algorithm was used in order to restrict 
the maximum number of ones allowed in the chromosome (number of patterns in 
the proposed subset) and at the end of the GA execution the hillclimbers where 
executed. 

The results are summarized in table 3. In the table the column labeled 
ComRa is the compression ratio as defined by expression (2) before the hill- 
climbers. The column labeled HC ComRa is the compression ratio obtained 
after both hillclimbers. 

It is clear from table 3 that GA outperforms the results obtained by the 
application of cover based heuristics. But the solution proposed by this heuristic 
however can be improved more efficiently than the one obtained from the GA, 
since after the application of hillclimbers the best subset is the one obtained 
by the cover-based heuristic. In all cases the solution proposed by the cover- 
based-l-hillclimbers contains the patterns used to build the sample. The patterns 
obtained by GA contains al least 60% of such patterns. 
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Table 1. Characteristics of samples used. 



Sample 


Size 


Alphabet 


Patterns 


ComRa 


Patt. found 


1 


64 


8 


5 


-0.1406 


75 


2 


133 


13 


4 


0.3233 


309 


3 


185 


37 


7 


0.3189 


25 



Table 2. Parameters used for GA. 



Parameter 


Value 


Population size 


100 


Selection scheme 


Vasconcelos 


Crossover 


2-point 


Crossover probability 


0.9 


Mutation probability 


0.05 



Table 3. Summary of results for test cases. 





Genetic Algorithm | 


C-B Heuristic | 


Sample 


Gen 


ComRa 


HC ComRa 


Patt 


ComRa 


HC ComRa 


Patt 


1 


75 


-0.2187 


-0.2187 


5 


-0.3280 


-0.1875 


5 


2 


100 


0.0225 


0.3230 


5 


0.3082 


0.3310 


5 


3 


31 


0.2756 


0.2756 


7 


-0.0972 


0.3189 


7 



6 Summary and Further Work 

We have defined a problem whose solution is useful for a dictionary-based 
data compression technique: the OptimalPatternSubsetProblem. We have 
proved that such problem is NP-complete. Two different heuristic techniques 
have been proposed for its approximate solution: a genetic algorithm and a 
cover-based heuristic technique. 

In order to refine the solutions proposed by these heuristics two different 
methods of hillclimbing where introduced: mshc and mrhc. These hillclimbers 
are executed iteratively over the solutions proposed by the heuristics described 
until no better proposal is found. 

Our very preliminary results show that, the best heuristic is the cover-based 
one. This is to be expected since the GA is not modified with special purpose 
operators. However a repair algorithm was included in every generation of GA in 
order to reduce the search space restricting the number of bits with value 1 in 
the chromosomes. 

The cover-based heuristic technique does not outperform the GA by itself, 
at least not necessarily. The efficiency in the solution proposed by this heuristic 
was achieved through the use of hillclimbers. Also the cover-based heuristic is 
several times faster than the GA, since there is not an evolutionary process. 

Further experimentation using the cover-based technique is required in order 
to provide statistically robust performance measurements, the results shown are 
not conclusive. However in order to provide a robust evaluation we need to 







10 



A. Kuri and J. Galaviz 



solve the first phase og general compression procedure. Such evaluation can also 
compare the methods used with some other heuristics: tabu search, simulated 
annealing and memetic algorithms could be used. 

Once the selection of best subset is done, the third phase must be performed. 
The goal is to found a set of patterns which conform a meta-symbol dictionary. 
A similar approach is shown in [4]. Therefore, well known encoding techniques 
can be used on this meta-alphabet, such as those based in information theory. 
Then we are able to imagine that the sample we have has been produced by an 
information source whose alphabet is the set of meta-symbols rather than the 
original one. 
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Abstract. This article presents the use of the Simulated Annealing algorithm to 
solve the waste minimization problem in roll cutting programming, in this case, 
paper. Client orders, which vary in weight, width, and external and internal diam- 
eter, are fully satisfied; and no cuts to inventory are additionally generated, unless, 
they are specified. Once an optimal cutting program is obtained, the algorithm is 
applied again to minimize cutting blade movements. Several tests were performed 
with real data from a paper company in which an average of 30% waste reduc- 
tion and 100% in production to inventory are obtained compare to the previous 
procedure. Actual savings represent about $5,200,000 USD in four months with 
4 cutting machines. 



1 Introduction 

Paper industry has a great product variety which can be classified in four manufacturing 
segments: packaging, hygienic, writing and printing, and specialties. Every segment use 
paper rolls as a basic input for their processes. These rolls are named as master rolls and 
they vary in their internal diameter or center, c, external diameter or simply diameter, D, 
and width, w depending on the paper type and the process it will undertake. Fig. 1(a) 
shows these characteristics. 

Even though there are several paper manufacturing segments for paper, in a same 
segment there are several paper types which basic difference is density, G. 

Client orders received in a paper manufacturing company are classified by paper 
type, then by their diameter and center. Although, each order varies in width (cm) and 
weight (kg). The paper manufacturing company delivers to each client a certain number 
of rolls (with the diameter, center and width required) that satisfy the order. 

Once the orders are classified into groups by paper type (paper density), diameter, 
and center, each group is processed in a cutting machine with a maximum fixed width 
combining the order widths in order to satisfy each order weight having as objective to 
minimize the non used width of the master roll (W in Fig. 1(b)). The previous described 
optimization problem is a combinatorial one since we are interested in finding the best 
widths combination to cut in a master roll satisfying all orders. All cutting combinations 
are grouped in a production cutting schedule, as shown in Fig. 1 (b), having a total weight 
or, equivalently, a total number of rolls to manufacture for each combination or item in 
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Fig. 1. Master rolls with and without cuts. 



the cutting schedule. Bigger widths are first satisfied in the present procedure to obtain 
the cutting schedule. 

These are the characteristics of the procedure performed in the paper manufacturing 
company where this research was carried out: 

- Two or more people using a spreadsheet type application, i.e. MS Excel, perform 
the task. 

- The process takes between 2.5 and 3 hours to obtain an acceptable total waste, which 
in this company was less than 10% of the total production weight, for four cutting 
machines using a very simple approximation iterative scheme. 

- The smallest total waste achieved with this procedure is due to the production of 
some cuts that go directly to inventory, i.e., certain commonly used widths are 
assigned to a “virtual” client’s order and created “on the fly.” 

By performing a deep analysis of the problem, we found the main problem was in this 
last characteristic since the company was generating about 30% more of the total waste 
as inventory production. This means that if the total waste was 9,000 ton, additionally 
12,000 ton were to inventory. 

The process to obtain a cutting schedule has other characteristics: 

- The cutting schedule is not optimal since they measure waste as unused width (cm) 
and not as unused weight (kg) of a cutting combination. 

- The procedure required time limits the number of times (two or three) which can be 
performed before deciding which cutting schedule is going to be produced. 

- The process considers a ±10% of the order weight (ocassionally, more). 

- The same people that perform the task decide the order of each combination in the 
cutting schedule before sending it for production. 

- Cutting blade movement is not considered in the process and it is performed by the 
cutting machine operator. 

This research shows the development of an application that: 

- automates a new generating procedure to obtain a cutting schedule, 

- the objective function is total waste weight. 
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- eliminates the need of generating cuts to inventory, and 

- minimizes cutting blade movement 

by using the simulated annealing algorithm [15,7,8,9,10,11,12]. 



2 Methodology 

The first problem to solve is to create a new way of generating the cutting schedule. For 
that and knowing that the paper manufacturing company delivers paper rolls, we first 
compute the number of rolls that are equivalent to the each weight order. Master roll 
weight is given by the following equation: 

Pr = (79^ — c^)ttwG (1) 



where pr is the master roll weight (in kg), D is the external diameter (in m), c is the 
center (internal diameter in m), w is the master roll width (in m), and G is the paper 
density (in kg/m^). With this, we can compute the number of rolls for an order: 



Pi/w^ i _ r p^w 

Pr/W PrWi 



(2) 



where rir is the number of rolls for the i-th order, pi (in kg) is the weight for the f-th 
order, Wi (in m) is the width of the i-th order, pr (in kg) is the master roll weight , and 
w (in m) is the master roll width. 



2.1 Waste Optimization 

The cutting schedule procedure is performed by selecting a random order width Wi from 
the order set and the number of times v it will be repeated in a combination, verifying 
that master roll unused part W be greater or equal to the that width WiV. This means 
that we can include in a new cut, possibly several times, in a cutting combination if 
the total width is smaller than the unused part of the master roll. When the unused 
part W of the master roll is smaller than the smallest width in not included orders in a 
combination, then a cutting combination has been generated. The number of rolls for 
this combination is determined by the smallest number of rolls or an order included in 
that cutting combination (repeated widths are considered). With this, the number of rolls 
for the orders included in the combination is updated. The process continues until every 
order is satisfied. Table 1 shows an order list where R is equivalent the number of rolls 
fot the required weight. 

Table 2 shows a list of cutting combinations or cutting schedule where Wp is the total 
width for that combination, R is the number of rolls to manufacture for that combination, 
ID is the order ID (see Table 1), Wi, {i = 1, 2, 3) is the width of order ID, and Vi is the 
number of times that width is repeated in the combination. 

Once the cutting schedule is generated, the waste for each combination is computed 
as follows: 

Wk = (3) 

w 
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Table 1. Cutting orders. 



ID 


Width 

(cm) 


Weight 

(kg) 


R 


ID 


Width 

(cm) 


Weight 

(kg) 


R 


0 


55.0 


2035 


6 


19 


83.0 


1840 


4 


1 


145.0 


5365 


6 


20 


68.5 


20376 


47 


2 


50.0 


2267 


8 


21 


79.0 


3661 


8 


3 


150.0 


1125 


2 


22 


69.0 


3190 


8 


4 


135.0 


5108 


6 


23 


79.0 


3652 


8 


5 


80.0 


5386 


11 


24 


83.0 


3432 


7 


6 


105.0 


4030 


6 


25 


91.5 


13240 


23 


7 


90.0 


2842 


5 


26 


85.0 


5702 


11 


8 


100.0 


3158 


5 


27 


81.0 


15405 


30 


9 


55.0 


8137 


24 


28 


24.0 


1742 


12 


10 


51.0 


34295 


105 


29 


100.0 


8162 


13 


11 


70.0 


3225 


8 


30 


64.0 


15000 


37 


12 


70.0 


3225 


8 


31 


181.0 


5500 


5 


13 


69.0 


2084 


5 


32 


181.0 


20000 


18 


14 


72.0 


2175 


5 


33 


201.0 


17000 


14 


15 


59.5 


6015 


16 


34 


195.6 


20000 


16 


16 


87.0 


2026 


4 


35 


200.0 


150000 


117 


17 


87.0 


3172 


6 


36 


201.0 


80000 


63 


18 


64.5 


9529 


24 


37 


181.0 


40000 


35 



where Wk is the waste for fc-th cutting combination, W is the master roll unused part, pr 
is the master roll weight, w is the master roll width, and rimax is the maximum number 
of rolls to manufacture for that combination. The total waste Wt is the sum of each 
combination waste Wt '■ 

WT = Y^Wk (4) 

k 

which is the objective funtion for our optimization problem. 

Optimization is performed only with order widths which are feasible to combine, 
i.e., if there is an order width that satisfies: 

Wi + Wmin > W, (5) 

where Wmin is the smallest order width, then, it is not considered for optimization since 
the waste generated by these orders is fixed with or without optimization. 

2.2 Cutting Blade Movements Optimization 

The initial input solution for the cutting blade movements optimization is the resulting 
optimized cutting schedule from the previous stage. Our new objective function is the 
difference in position of the cutting blades from combination i and cutting blades from 
combination z + 1, for i = 0, . . . ,Up — 1, where Up is the number of combinations in 
the cutting schedule. 

The generation of a new solution state for this stage is similar to previous optimiza- 
tion stage, i.e., two cutting combinations are randomly selected with variable distance 
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Table 2. One possible cutting schedule for orders in Table 1 . 



j 


Wp 


R 


ID 


W1 




ID 


W2 


V2 


ID 


Wz 


Vz 


0 


199.0 


6 


6 


105.0 


1 


12 


70.0 


1 


28 


24.0 


1 


1 


195.0 


1 


12 


70.0 


2 


0 


55.0 


1 








2 


200.0 


5 


8 


100.0 


1 


29 


100.0 


1 








3 


200.0 


4 


29 


100.0 


2 
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198.0 


8 
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50.0 


1 


23 


79.0 
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22 


69.0 
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197.0 
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90.0 
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24 


83.0 
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28 


24.0 
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6 
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150.0 
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28 


24.0 


1 
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55.0 


1 
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20 
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27 


81.0 


1 


10 


51.0 


1 
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170.0 
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26 


85.0 


2 














10 


197.0 
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16 
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1 


9 


55.0 


2 
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196.0 
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1 


145.0 


1 


10 


51.0 


1 
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199.0 


8 


15 


59.5 


2 


5 


80.0 


1 








13 


160.0 


1 
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80.0 


2 
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193.5 


8 
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64.5 


3 
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21 
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55.0 


1 


17 


196.5 


17 


30 


64.0 
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20 


68.5 


1 
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10 


51.0 


1 
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195.0 
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26 
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55.0 


2 
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1 
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10 


51.0 
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21 
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23 


25 


91.5 
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10 


51.0 


2 








22 
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10 


51.0 
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14 


72.0 


2 








23 


195.0 
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11 


70.0 
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55.0 


1 








24 


170.0 
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17 


87.0 


1 


19 


83.0 


1 








25 


185.0 


2 


24 


83.0 


1 


10 


51.0 


2 








26 


153.0 
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10 


51.0 


3 














27 


192.0 


1 


14 


72.0 


1 


13 


69.0 


1 


10 


51.0 


1 


28 


138.0 


2 


13 


69.0 


2 














29 


174.0 


1 


17 


87.0 


2 















between them and then their positions are exchanged and then two different order width 
are randomly selected and exchange. 

Simulated annealing algorithm has been successfully used in robotics [8,9] and 
scheduling [10,11,12] applications. The following section the algorithm is described 
and its requirements for implementation in our problem. 

2.3 Simulated Annealing Algorithm 

Simulated annealing is basically an iterative improvement strategy augmented by a 
criterion for occasionally accepting higher cost configurations [14,7]. Given a cost func- 
tion C(z) (analog to energy) and an initial solution or state zq , the iterative improve- 
ment approach seeks to improve the current solution by randomly perturbing zg. The 
Metropolis algorithm [7] was used for acceptance/rejection of the new state z' at a given 
temperature T, i.e.. 
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- randomly perturb z to obtain z', and calculate the corresponding change in cost 
SC = z' — z 

- if SC < 0, accept the state 

- if SC > 0, accept the state with probability 

P{SC) = exp i-SC/T) , (6) 



this represents the acceptance-rejection loop or Markov chain of the SA algorithm. The 
acceptance criterion is implemented by generating a random number, p G [0, 1] and 
comparing it to P(SC); if p < P(SC), then the new state is accepted. The outer loop of 
the algorithm is referred to as the cooling schedule, and specifies the equation by which 
the temperature is decreased. The algorithm terminates when the cost function remains 
approximately unchanged, i.e., for n„o consecutive outer loop iterations. 

Any implementation of simulated annealing generally requires four components: 

1. Problem configuration (domain over which the solution will be sought). 

2. Neighborhood definition (which governs the nature and magnitude of allowable 
perturbations). 

3. Cost function. 

4. Cooling schedule (which controls both the rate of temperature decrement and the 
number of inner loop iterations). 



The domain for our problem is the set of cutting combinations. The objective or cost 
function is describe in the previous Section. The neighborhood function used for this 
implementation is the same used by [8] where two orders are randomly selected with 
the distance between them cooled, i.e. decreased as the temperature decreases. Once 
selected the two orders, their positions are exchange and another cutting schedule is 
generated. For the cutting blade movements optimization the relative distance between 
two randomly selected cutting combinations is also cooled. The allowable perturbations 
are reduced by the following limiting function 



iog(r-r/) 

log(To - Tf) 



(7) 



where Cmax is an input parameter and specifies the maximum distance between two 
elements in a list, and T,To,Tf are the current, initial and final temperatures, respectively. 

The cooling schedule in this implementation is the same hybrid one introduced by [8] 
in which both the temperature and the inner loop criterion vary continuously through the 
annealing process [2]. The outer loop behaves nominally as a constant decrement factor. 



Ti+i — aTi 



( 8 ) 



where a = 0.9 for this paper. The temperature throughout the inner loop is allowed to 
vary proportionally with the current optimal value of the cost function. So, denoting the 
inner loop index as j, the temperature is modified when a state is accepted, i.e, 

Ci 

Tj — — Tiast 

^last 



(9) 
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where Ciast and Tlast are the cost and temperature associated with the last accepted 
state. Note that at high temperatures, a high percentage of states are accepted, so the 
temperature can fluctuate by a substantial magnitnde within the inner loop. 

The following function was used to determine the number of acceptance-rejection 
loop iterations. 



Nin 



Ndo{ 



2 + 8 



/ log(T-T^) Y 

V log(To-r/)A 



(10) 



where Ndoi is the number of degrees of freedom of the system. 

The initial temperature must be chosen snch that the system has sufficient energy to 
visit the entire solution space. The system is sufficiently melted if a large percentage, 
i.e. 80%, of state transitions are accepted. If the initial gness for the temperatnre yields 
less than this percentage, Tq can be scaled linearly and the process repeated. The algo- 
rithm will proceed to a reasonable solntion when there is excessive energy; it is simply 
less computationally efficient. Besides the stopping criterion mentioned above, which 
indicates convergence to a global minimum, the algorithm is also terminated by setting 
a final temperature given by 

ry = a^°“‘To (11) 



where A^out is the number of outer loop iterations and is given as data to our problem. 



3 Results 

The source language for our implementation is in Python for Windows in a Pentinm II 
@400 Mhz with 256 MB RAM. All tests were performed with real data given by the 
paper mannfacturing company in which this research was carried ont. 

The initial data is shown in Table 1 and for a machine of 2.025 m maximnm width 
and 3 cutting blades. 

The present cutting schedule procedure generated a total waste of 17,628.5 kg and 
16,742.0 kg to inventory and it took almost three honrs. The system using the simulated 
annealing algorithm generated a total waste of 9,6 14.7 kg without the need of production 
cuts to inventory in about eight minutes. The resulting waste is about 46% less than the 
manual procedure and considering production to inventory also as waste, it is about 
72% smaller. The total production for the test is 211,317.3 kg. This means that the total 
waste compare to the total production decreased from 8.34% to 4.55% with considering 
prodnction to inventory as waste. If we consider it, it went down from 16.26% to 4.55%. 
Now, if the paper production cost of 1,000 kg is $400 USD, the SA system generated a 
total savings amount of $3,205 USD and considering inventory it was $9,902 USD. If 
we know each machine production rate, the previous savings are equivalent to $1,967 
USD/day and considering the inventory to $6,075 USD/day. 

The cutting blade movements optimization problem is shown in Fig. 2. The SA 
system generated a 41% decrease in total cutting blade movements compared to the 
manual procedure. Fig. 2(a) shows cutting blade movements of the resulting cntting 
schedule using the manual procedure and Fig. 2(b) shows the same but using the SA 
system. The last one shows a smoother distribution and smaller movements for the 
cutting schedule. Savings in cutting blade movements were an average of 48% smaller 
than the manual procedure due mainly to the lack of optimization in this part. 
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5 30 



(a) Manual procedure 




5 30 



(b) SA system 



Fig. 2. Cutting blade movements 



Other test were performed with orders for four different cutting machines (main 
difference in maximum width) and for four months. The average savings in total waste 
was 900,000 kg using the SA system which represented a 22% savings compared to the 
manual procedure. This represents a $360,000 USD in savings. However, production that 
went directly to inventory was 12,000,000 kg which represents $4,800,000 USD. If we 
consider that the S A system does not generate production to inventory (unless it is speci- 
fied), the total savings for four months with four cutting machines was $5,160,000 USD. 

4 Conclusions 

All result were validated by people in charge of obtaining the cutting schedule at the 
paper manufacturing company. Savings generated by the SA system allows not only an 
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optimization of the total waste but also in the elimination of production to inventory. The 
production to inventory resulted as the actual problem the company had. Even though 
some inventory production could be sold in the future, this production was generated 
to minimize the total waste during the generation of the cutting schedule. By using the 
system, it allows people from the company to spend more time in decision making, 
problem analysis and/or urgent orders since to obtain a cutting schedule takes about 10 
minutes. 

Cutting blade movements optimization was not performed in the manual procedure. 
The use of the SA system for this problem generates savings due to the increase in 
production rate a cutting machine. However, we do not include the analysis since we do 
not have access to that information. 

The authors are currently working in a global optimal cutting schedule generation 
system that will allow the use of several cutting machines (different width) to process 
an order list to generate the cutting schedule. 
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Abstract. In this paper we present a technique for prediction of electrical 
demand based on multiple models. The multiple models are composed by 
several local models, each one describing a region of behavior of the system, 
called operation regime. The multiple models approach developed in this work 
is applied to predict electrical load 24 hours ahead. Data of electrical load from 
the state of California that include an approximate period of 2 years was used as 
a case of study. The concept of multiple model implemented in the present 
work is also characterized by the combination of several techniques. Two 
important techniques are applied in the construction of multiple models: 
Regularization and the Knowledge Discovery in Data Bases (KDD) techniques. 
KDD is used to identify the operation regime of electrical load time series. 



1 Introduction 

Depending on the length of the study, forecasting of electrical demand can be divided 
in long-term forecasting (5 to 10 years), medium-term forecasting (months to 5 
years), and short-term forecasting (hours to months) [13]. 

Several techniques for load forecasting have been developed over the years (e.g. 
regressive models, stochastic time series, time-space models, expert systems, and 
artificial neural networks, among others). None of them have been able to produce 
satisfactory models for electrical demand forecasting. Short-term demand forecasting 
of electricity depends mainly on the weather, which is stochastic by nature [15]. 

Another factor, occurring in several countries, that makes short-term forecasting 
very important is the new trend to produce electrical markets [13]. This situation is a 
natural consequence of the technological development in the electrical industry, 
which has reduced production costs and created a new way to produce electricity. 
These are the reasons why the electrical market is no longer a monopoly, as it was in 
the SON. This revolution in the electricity industry aims to a decentralized and more 
competitive industry [15]. This scenario has spawned the need for new, better, and 
more accurate, tools to model and analyze electrical demand. Modeling and analysis 
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Fig. 1. Operation Regimes 




} 



X“' 



Fig. 2. Multi-models Conceptual Scheme 

are important in several areas of the electrical industry; in particular in the analysis of 
the structure of electrical markets and forecasting the price of electricity [15]. 

This paper presents a methodology to predict electrical demand 24 hours ahead, 
using a technique called multi-models. Multi-models uses data mining to cluster the 
demand data and develops a local model for each cluster. 



2 Operation Regimes and Multi-models 

This section briefly describes the technique used to develop the forecasting model 
proposed in this paper. We call this methodology multi-models, for it combines 
several local models to form the global model of the system. This methodology 
assumes that systems can be decomposed in several regions, called operation regimes, 
exhibiting different behaviors (see Figure 1). 

Figure 2 illustrates the concept of multi-models. In figure 2, are the known data, 
X‘‘*‘ are the data to predict and J indicates the operation regime. 

Our implementation partitions the time-series data-base in operation regimes using 
a clustering algorithm. Clustering, a concept borrowed from data-mining, is 
performed using AutoClass, a program based on stochastic classification. 

Assuming that the relationship of the electrical load of those days belonging to the 
same operation regime is linear, and that the variation of the load characteristics of 
adjacent days is not large (we assume that adjacent days belong to the same operation 
regime), we apply a linear model for each operation regime. To achieve the second 
assumption, we regularize the time-series using a first order membrane filter in order 
to find the load tendency curve. Both assumptions lead to a great simplification in the 
construction of the model. The first simplification allows us to use linear models in all 
operation regimes; the second one allows us to use the local model for today to 
predict tomorrow’s behavior. 
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The main motivation of developing multi-models is the need to find a robust and 
precise forecasting method for electrical load. Most forecasting methods, particularly 
the statistical and linear ones, are not robust to abrupt system changes. These facts 
indicate that those models work well only within a narrow space of operation 
conditions. As a solution to this problem, several models capable of adaptation to 
abrupt system changes have been developed. Nevertheless, those techniques require 
more complex models which are harder to analyze. 

The basic concept of multi-models combines a set of models, where each model 
describes a limited part of the system that behaves specifically (an operation regime) 
[10]. We call local model to each model corresponding to an operation regime, and 
global model to a holistic model. The main advantages of the use of multi-models are 
(see [9] and [10]): 

• Resulting models are simpler, because there are only a few relevant local 
phenomena. Modeling a caotic system with a holistic model can be impossible, or 
the accuracy of the prediction of that model would be unacceptable. 

• The framework works for any kind of representation used for local models. A 
direct consequence of this is that the framework forms a base for the production of 
hybrid models. 

According to [9] and [10], the multi-model problem can be decomposed into the 
following phases: 

• Decomposition of the global system in operation regimes. 

• Identification of the local model's structures for each operation regime. 

• Identification of the parameters of the local models. 

• Combination of the local models to form the global model (an implicit step). 

Given a time-series data-base of electrical load X'^, we estimate the values that 
will occur in the future, assuming there exists a relation between X‘‘*‘ and X‘‘, given in 
general by equation 1 . 

x*+‘ = /(x*) (1) 

Since the system is decomposed in several operation regimes, f will be composed 
by a set of simpler functions f^, each of them describing a given operation regime. 
Formally, this idea can be expressed as in eq. 2. 

/,(xf)ifOR = l (2) 

/,(xf)if OR = 2 

/,(xf)if OR = J 
Where j=l,2, . . ., J are the operation regimes (OR). 
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3 Regularization 



In this section we describe an adaptable filtering technique that allows us to estimate 
the low frequency components of the time-series. This technique, called regulariza- 
tion, is used the trend in the electrical load. That is, it decomposes the time-series in 
two components: a tendency signal and a signal without a tendency. 

The regularized signal Z is obtained by minimizing equation 1. Where E(Z) is the 
energy needed to approximate the values of Z to h (see equation 3). 

N N-l ( 3 ) 

E(Z) = '^ [h'‘ - Z* )-!- - Z* ) 

k=l k=l 



Where N is the sample size, h is the observed signal and Z is the signal to be 
estimated (the trend signal). The first term of equation 3 represents the data 
constraints specifying that the smooth signal Z‘‘ must not bee too different to the 
observed signal h*". The second term represents the smoothing constraint, stating that 
neighboring values must be similar; if X (the regularization constant) is large, Z*‘ will 
be very smooth, and therefore very different than h''. 

The smoothing constraint from equation 3 is an approximation of the gradient 
magnitude: 




By algebraic manipulation, from equation 4, we can obtain the values of Z‘‘ that 
minimizes the energy E(Z), and solving for h'‘ we have. 

h'‘ =-AZ'‘-^+{l+2X)z'‘ (5) 

The trend curve is given by the values of vector Z, computed using equation 4, 
expressed using linear algebra. 

Z = M (6) 

Note that M is a large sparce matrix. We apply the Fourier Transform to both 
members of equation 5, yielding equation 7. 

H{w) = Z(w)[l -I- Z(2 - (7) 

Simplifying, 

1 ( 8 ) 

H(w) l-l-2Z(l-Co^(w)) 



Z(w) = F(w)H(w) 



1 

l-l-2Z(l-Coi(w)) 



H(w) 



(9) 



Equation 8 is the frequency response of the first order membrane filter, and 
equation 9 represents the convolution between signal h and the membrane filter. 
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4 Clustering 



Operation regimes are detected using a clustering algorithm, taken from the field of 
Knowledge Discovery and Data-Mining (KDD) (for a detailed description of the 
KDD process, see [5, 1, and 8]). In this phase we used clustering as our Data-Mining 
algorithm. In general terms, clustering partitions a group of objects in subgroups, 
where the elements of each subgroup are similar among them and dissimilar 
compared to members of other subgroups [5, 3, and 15]. The concept of similarity 
may take different meanings; for instance, shape, color, size, etc. (see [14 and 3]). In 
many algorithms and applications, similarity is defined in terms of metric distances 
[6]. 

Clustering is a form of non-supervised learning, where the goal is to find structures 
in the data in the form of natural groups. This process requires data to be defined in 
terms of attributes relevant to the characteristics we want the classification to be based 
upon. 

An example of an application that performs clustering is AutoClass. AutoClass is a 
clustering algorithm based on Bayesian theory, using the classical model Finite 
Distribution Merging (see [5]). This model states that each instance belongs to one 
and only one unknown class from a set of J classes, with probability given by 
equation 10. 

p(x. gC^.|p/) (10) 



Where X is an instance, is class j, PI is the a priori information including the 
class distribution model, the set of parameters of the distribution model, and the 
available search space. 

Assuming the number of classes is known, AutoClass tries to maximize the a 
posteriori likelihood of the class partition model. With no information about the class 
membership, the program uses a variation of the Expectation Maximization (EM) 
algorithm [12] to approximate the solution to the problem. Equation 11 defines 
weights bj^, used to compute the statistical values of the unknown class. 

,ec,,p/)p(x,ec,|p/) 
x,gc^,p/)p(a,gc^|p/ 



f,p{x 



Where 



a,=p(x,eC^\Pl), 0 <«,<!, = 1 

j=i 



( 12 ) 



Equation 12 describes the statistical values of a normal distribution (number of 
classes, mean, and variance). 

Using the statistical values computed from equations 12 and 13, we can estimate 
the weights. We repeat the process until convergence leads to the proper estimation of 
the parameters. The probability density function with a highest a posteriori likelihood 
is the final result [5 and 11]. 
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(a) Original Data (b) Regularized data 
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Fig. 3. Regularization 
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Fig. 4. Classes obtained using AutoClass 



5 Case of Study 

For our study case we took the data-base of the California ISO [2]; this data-base 
contains hourly measurements from September 1, 2000 to October 14*, 2002, for a 
total of 18,591 measurements. We processed and transformed the time-series into a 
matrix of 774 rows (days) and 3 columns (maximum, minimum, and average values). 
We took 744 days as the training set to predict day 745 (September 15*, 2002). 

^ ^ / 12 ( 13 ) 

-=1 ^2 



The training set was subject to the following data-mining and forecasting process. 

• Errors were corrected, missing and bad data were fixed. 

• The time-series was regularized using the first order membrane filter, 
decomposing the time-series in trend and regularized data. Figure 3 shows the 
original data, the regularized data and the trend. 

• The trend was predicted for day 745 using multiple-linear regression. 

Using AutoClass, we found that day 745 belongs to class 1. Figure 4 shows the 
classification produced by AutoClass. Data mining is an iterative process; we are 
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Fig. 5. Real Forecast for day 745 an Absolute Fig. 6. Real and Estimated Forecast between 
Error 09/ 1 5/2002 and 09/24/2002 



Table 1. Statistical parameters for the error 



Parametei 


a 


e(%) 


P 


e(%) 




Value 


2.31 


1.78 


0.9904 


10.17 


1227 



Table 2. Statistical Parameters for the Error in the California Model 



Parametro 


a 


£(%) 


P 


e(%) 


Se 


Multi-models 


2.31 


1.78 


0.9904 


10.17 


1227 


California Model 


2.59 


2.38 


0.9888 


11.78 


1623.3 



showing here, of course, only the last iteration of the classification process. In this 
process, data was exposed to AutoClass using several attributes; for instance, the 
Fourier components of each day, up to the fifth harmonic. None of them worked as 
well as the final one, where the attributes were maximum, minimum, and average 
values for each day. 

Using linear regression, we forecasted the regularized load. Then added the 
predicted trend and the predicted regularized load for day 745, obtaining the real 
forecast, presented in Figure 5. 

Using the same methodology, we forecasted load for days 746 to 774. Figure 6 
shows the results for the first 10 forecasted days. That is, from September IS"*, to 
September 24“', 2002. 

Table 1 shows statistical parameters for the error estimation using multi-models in 
the period 09/15/2002 to 10/14/2002. Where a is the standard deviation, g the mean 
absolute error, p the correlation coefficient between real and estimated load, e the 
maximum absolute error, and 2 g the accumulated error in the same period (that is, 
the area integral of the load function during a day). 

Figure 7 is the forecasting error histogram for a 30 day period. This histogram 
shows that the most frequent errors fall in the interval from 0 to 4%. 

Our results were compared with those provided by the California ISO predictor. 
Table 2 shows the values of statistical parameters for the error estimation for the 
California model, Figure 8 shows a comparison between the prediction of both 
models, and Figure 9 shows the error histogram for the California model. 
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Fig. 7. Forecasting Error Histogram for a 
30 Day Period 



Fig. 8. Comparing Predictions of Multi-models 
and the California Model 




Fig. 9. Forecasting Error Histogram for a 30 Day Period for the California Model 



6 Conclusions 

In this work we present a technique we called multi-models. Using multi-models, we 
successfully forecasted a 24 hours ahead electrical demand on the California ISO 
model. That forecasting was done even without any weather information. The 
conceptual framework for developing multi-models is very general, and allows us to 
implement it using mixed techniques for the local models. In our implementation, we 
used least squares on linear models. 

In order to be able to apply the concept of multi-models we needed to pre-process 
the time-series. First we regularized it, using a first-order membrane filter. This 
regularization got rid of any trend included in the signal, which was accounted for at 
the final step. After regularization, we applied clustering to classify the days 
according to their features. For each class (operation regime) a separate linear model 
was produced. The classification was done using Auto Class, a probabilistic clustering 
tool. At the end, we put together the forecast, including trend and the prediction done 
via a local model. 

The results obtained were comparable with the model developed by the California 
ISO, and better in many situations. 

Our next step is not to consider contiguous days in the same regime. This 
assumption leads to the paradoxical situation that all days must be in the same regime. 
Instead, we will model, perhaps using a neural network, the regime series, predicting 
what model to use for the next day, therefore getting even more accurate results. 
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Abstract. A new method for extracting valuable process information 
from input-output data is presented in this paper using a pseudo-gaussian 
basis function neural network with regression weights. The proposed 
methodology produces dynamical radial basis function, able to modify 
the number of neuron within the hidden layer. Other important char- 
acteristic of the proposed neural system is that the activation of the 
hidden neurons is normalized, which, as described in the bibliography, 
provides better performance than non-normalization. The effectiveness of 
the method is illustrated through the development of dynamical models 
for a very well known benchmark, the synthetic time series Mackey-Glass. 

1 Introduction 

RBF networks form a special neural network architecture, which consists of three 
layers, namely the input, hidden and output layers. The input layer is only used 
to connect the network to its environment. Each node in the hidden layer has 
associated a centre, which is a vector with dimension equal to that of the network 
input data. Finally, the output layer is linear and serves as a summation unit: 

K 

FRBFjXn) = y^^Wi(j)^ {Xn,Ci,ai) (1) 

i=l 

where the radial basis functions (pi are nonlinear functions, usually gaussian 
functions [9]. An alternative is to calculate the weighted average F^bf 
radial basis function with the addition of lateral connections between the radial 
neurons. In normalized RBF neural networks, the output activity is normalized 
by the total input activity in the hidden layer. 

The use of the second method has been presented in different studies as 
an approach which, due to its normalization properties, is very convenient and 
provides better performance than the weighted sum method for function approx- 
imation problems. In terms of smoothness, the weighted average provides better 
performance than the weighted sum [8,3]. 
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Assuming that training data (xi, jji), i = 1, 2, . . . , U are available and have 
to be approximated, the RBF network training problem can be formulated as an 
optimization problem, where the normalized root mean squared errors (NRMSE) 
between the true outputs and the network predictions must be minimized with 
respect to both the network structure (the number of nodes K in the hidden 
layer) and the network parameters (center, sigmas and output weights): 



NRMSE = 




(2) 



where is the variance of the output data, and is the mean-square error 
between the obtained and the desired output. The development of a single pro- 
cedure that minimizes the above error taking into account the structure and the 
parameters that define the system, is rather difficult using the traditional opti- 
mization techniques. Most approaches presented in the bibliography consider a 
fixed RBF network structure and decompose the optimization of the parameters 
into two steps: In the first step the centres of the nodes are obtained (different 
paradigms can be used as cluster techniques, genetic algorithms, etc .) and in the 
second step, the connection weights are calculated using simple linear regression. 
Finally, a sequential learning algorithm is presented to adapt the structure of 
the network, in which it is possible to create new hidden units and also to detect 
and remove inactive units. 

In this paper we propose to use a pseudo-gaussian function for the nonlinear 
function within the hidden unit. The output of a hidden neuron is computed as: 



4>i{x) = 




— OG < X < 
cl < X <oo 



(3) 



The index i runs over the number of neurons (AT) while v runs over the dimen- 
sion of the input space {v G [1, D]). The behaviour of classical gaussian functions 
and the new PG-RBF in two dimensions is illustrated in Fig. 1 and Fig. 2. 

The weights connecting the activation of the hidden units with the output of 
the neural system, instead of being single parameters, are functions of the input 
variables. Therefore, the Wi are given by: 

Wi = J2blx^ + b^ (4) 



where b" are single parameters. 

The behaviour of the new PGBF in two dimensions is illustrated in Fig.l. 
Therefore, the structure of the neural system proposed is modified using a 
pseudo-gaussian function (PG) in which two scaling parameters a are introduced, 
which eliminate the symmetry restriction and provide the neurons in the hidden 
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VAni/VfilFY -< VAHIADLEIX 

Fig. 1. 3-D behaviour of a pseudo-gaussian function for two inputs 




Fig. 2. Contour of a pseudo-gaussian function for two inputs 



layer with greater flexibility for function approximation. Other important char- 
acteristics of the proposed neural system are that the activation of the hidden 
neurons is normalized and that instead of using a single parameter for the output 
weights, these are functions of the input variables which leads to a significant re- 
duction in the number of hidden units compared with the classical RBF network. 

2 Sequential Learning Using PGBF Network 

Learning in the PGBF consists in determining the minimum necessary number 
of neuron units, and in adjusting the parameters of each individual hidden neu- 
ron, given a set of data (x„,y„)[5]. The sequential learning algorithm starts with 
only one hidden node and creates additional neurons based on the novelty (inno- 
vation) in the observations which arrive sequentially. The decision as to whether 
a datum should be deemed novel is based on the following conditions: 
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S-n — \ljn ^ ? 

/3max = max {(j),) > C 



( 5 ) 



If both conditions are satisfied, then the data is considered to be novel and 
therefore a new hidden neuron is added to the network. This process continues 
until a maximum number of hidden neurons is reached. The parameters ^ and C 
are thresholds to be selected appropriately for each problem. The first condition 
states that the error must be significant and the second deals with the activation 
of the nonlinear neurons. The parameters of the new hidden node are determined 
initially as follows: 



k= k+1 

h"" = ^RBF if V = 0 1 

^ \ 0 otherwise / 

CK = Xn, {c\ = x'l^yv G [1,T>]) 

<Xk+ = (Jinit min ||x„ - Ci\\ 

’ ’ 2 = 1,...,A— 1 



(6) 



where 7 is an overlap factor that determines the amount of overlap of the data 
considered as novel and the nearest centre of a neuron. If an observation has no 
novelty then the existing parameters of the network are adjusted by a gradient 
descent algorithm to fit that observation. We propose a pruning strategy that 
can detect and remove hidden neurons, which although active initially, may 
subsequently end up contributing little to the network output. Then a more 
streamlined neural network can be constructed as learning progresses. For this 
purpose, three cases will be considered: 



(a) Pruning the hidden units that make very little contribution to the overall 
network output for the whole data set. Pruning removes a hidden unit i when: 

^ 1 

E 4>i(Xn) 



e,= 



Yl — 1 



< xi', where is ^ threshold. 



(b) Pruning hidden units which have a very small activation region. These 
units obviously represent an overtrained learning. A neuron i having very 
low values of a'" ^ + cr“_ in the different dimensions of the input space will 

be removed:^ + (^i-) < Xi 



— (c) Pruning hidden units which have a very similar activation to other neu- 
rons in the neural system. To achieve this, we define the vectors '4’i=i...N, 
where N is the number of input/output vectors presented, such that: 
’’Pi = [Piixi), 4>i{x2), • ■ • , 4>iixn)]. As a guide to determine when two neurons 
present similar behaviour, this can be expressed in terms of the inner product 
p’i'ipj < X3 • If III® inner product is near one then ipi and ipj are both attempt- 
ing to do nearly the same job (they possess a very similar activation level for 
the same input values). In this case, they directly compete in the sense that 
only one of these neurons is selected and therefore the other one is removed. 



If any of these conditions are fulfilled for a particular neuron, the neuron is 
automatically removed. 
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The final algorithm is summarized below: 

Step 1: Initially, no hidden neurons exist. 

Step 2 : Set n = 0, if = 0, = 1, where n, K and h are the number of patterns 

presented to the network, the number of hidden neurons and the number of 
learning cycles, respectively. Set the effective radius -h Set the maximum number 
of hidden neurons Max Neuron. 

Step 3: For each observation (x„, ?/„) compute: 
a) the overall network output: 



(Xn^ Cj , (Jj) — 

? / N i=i 

FrbfM = = -= (7) 

i=l 

b) the parameter required for the evaluation of the novelty of the observation; 
the error e„ = j/„ — F^bf the maximum degree of activation Pmax- If 

((cn > C) &nd {j3max < F) and (if < MaxNeuron)) allocate a new hidden unit 
with parameters: 

k= k+1 

h"" = / ~ ^RBF if V = 0 1 

^ \ 0 otherwise / (g) 

CK = Xn] {c^K = G [!,£>]) 

(Xk+ = ^K-=l <Xinit min ||x„ - Ci\\ 



else apply the parameter learning for all the hidden nodes: 

/\„v dE dE ^^RBF d<pi 

1 dc^ Qp* d^i dc^ 

= iVn - F^bp)^^=^* 



- iVn - 






[/(a:--oo,cn+2^e U{xl;c^,oo) 





_ dE _ 


dE ^^RBF 


84>i 




dcrV 


dP* d<Pi 

„ „ (*' 




iVn 


\Wi-yn 

RBF) 


r% xZ — c; 

2 * e 

'^>,+ 





Aal_ = {y^-F*ppp)^ 2^-f^e t/«;-oo,cn 



dE 


dE 


^^RBF 


dN dwi 


db« — 


^^RBF 


dNum 


dwi db^ 


dE 


dE 


^^RBF 


dN dwi 


dbX — 


^^RBF 


dNum 


dwi db^ 
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Step 4: If all the training patterns are presented, then increment the number 
of learning cycles {h = h + 1), and check the criteria for pruning hidden units: 



N 

E 4)ii.Xn)\ < Xl 

J 

E {<,+ + <-) < X2 

V 

V'i • V'i < X3, Vj yf i 



(12) 



Step 5: If the network shows satisfactory performance {NRMSE < tt*) then 
stop. Otherwise go to Step 3. 



3 Using GA to Tune the Free Parameters of the 
Sequential Learning Algorithms 

Stochastic algorithms, such as simulated annealing (SA) or genetic algorithms 
(GA) are more and more used for combinatorial optimization problems in diverse 
fields, and particularly in time series [6]. The main advantage of GA upon SA is 
that it works on a set of potential solutions instead of a single one; however, on 
particular applications, the major inconvenient lies in the difficulty of carrying 
out the crossover operator for generating feasible solutions with respect to the 
problem constraints. Insofar as this last point was not encountered in this study, 
a GA [2,4] has been retained. 

Genetic algorithms are searching methods based upon the biological princi- 
ples of natural selection and survival of the fittest introduced by Gharles Dar- 
win in his seminal work “The Origin of Species” (1859). They were rigorously 
introduced by [2]. GAs consist of a population of individuals that are possible 
solutions and each one of these individuals receives a reward, known as “fitness” , 
that quantifies its suitability to solve the problem. In ordinary applications, fit- 
ness is simply the objective function. Individuals with better than average fitness 
receive greater opportunities to cross. On the other hand, low fitness individuals 
will have less chance to reproduce until they are extinguished. Gonsequently, the 
good features of the best individuals are disseminated over the generations. In 
other words, the most promising areas of the search space are explored, making 
the GA converge to the optimal or near optimal solution. 

The ‘reproduction’ process by which the new individuals are derived consists 
in taking the chromosomes of the parents and subjecting them to crossover and 
mutation operations. The symbols (genes) from parents are combined into new 
chromosomes and afterwards, randomly selected symbols from these new chro- 
mosomes are altered in a simulation of the genetic recombination and mutation 
process of nature. The key ideas are thus the concept of a population of indi- 
vidual solutions being processed together and symbol configurations conferring 
greater fitness being combined in the ‘offspring’, in the hope of producing even 
better solutions. 
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As far as this paper is concerned, the main advantages of a GA strategy lie in: 

— 1. The increased likelihood of finding the global minimum in a situation 
where local minima may abound. 

— 2. The flexibility of the approach whereby the search for better solutions 
can be tailored to the problem in hand by, for example, choosing the genetic 
representation to suit the nature of the function being optimized. 

In the Sequential learning algorithms proposed in section 2, there exist dif- 
ferent parameter that should be tuned in order to obtain optimal solution. This 
parameters are: C, Xi, X 2 , Xs- One possibility is to make this work by trial and 

error, and the second possibility is to use the GA as an optimization tool that 
must decided the best value for this parameters. In the way described above, the 
relations between the different paradigms are described in figure 3: 



Genetic Algorithm 



Time Series Prediction 





Fig. 3. Block diagram of the different paradigms used in the complete algorithm 



4 Application to Time Series Prediction 



In this subsection we attempt a short-term prediction by means of the algorithm 
presented in the above subsection with regard to the Mackey-Glass time series 
data. The Mackey-Glass chaotic time series is generated from the following delay 
differential equation: 



dx{t) 

dt 



ax{t — t) 
1 -I- x{t — 



bx{t) 



(13) 



When r > 17, the equation shows chaotic behaviour. Higher values of t yield 
higher dimensional chaos. To make the comparisons with earlier work fair, we 
chose the parameters of n = 4 and P = 6. 
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DESIRED (SOLID LINE) AND PREDICTED (DASHED LINE) 
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Fig. 4. 3-D Prediction step = 6 and number of neurons = 12. (a) Result of the original 
and predicted Mackey-Glass time series (which are indistinguishable), (b) prediction 
error 



Table 1. Comparison results of the prediction error of different methods for prediction 
step equal to 6 (500 training data). 



Method 


Prediction Error 
(RMSE) 


Auto Regressive Model 


0.19 


Cascade Correlation NN 


0.06 


Back-Prop. NN 


0.02 


6th-order Polynomial 


0.04 


Linear Predictive Method 


0.55 


Kim and Kim (Genetic Algorithm and 
Fuzzy System [6] 


5 MFs 


0.049206 


7 MFs 


0.042275 


9 MFs 


0.037873 


ANFIS and Fuzzy System (16 rules) [6] 


0.007 


Classical RBF (with 23 neurons) [1] 


0.0114 


Our Approach (With 12 neurons) 


0.0036 ± 0.0008 
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To compare our approach with earlier works, we chose the parameters pre- 
sented in [6]. The experiment was performed 25 times, and we will show graph- 
ically one result that is close to the average error obtained. Fig4a) shows the 
predicted and desired values (dashed and continuous lines respectively) for both 
training and test data (which is indistinguishable from the time series here) . As 
they are practically identical, the difference can only be seen on a finer scale 
(Fig4b)). Table 1 compares the prediction accuracy of different computational 
paradigms presented in the bibliography for this benchmark problem (including 
our proposal), for various fuzzy system structures, neural systems and genetic 
algorithms [4, 5, 6, 7] (each reference use different number of decimal for the pre- 
diction, we take exactly the value presented). 



5 Conclusion 

This article describes a new structure to create a RBF neural network that 
uses regression weights to replace the constant weights normally used. These 
regression weights are assumed to be functions of the input variables. In this 
way the number of hidden units within a RBF neural network is reduced. A 
new type of nonlinear function is proposed: the pseudo-gaussian function. With 
this, the neural system gains flexibility, as the neurons possess an activation 
field that does not necessarily have to be symmetric with respect to the centre 
or to the location of the neuron in the input space. In addition to this new 
structure, we propose a sequential learning algorithm, which is able to adapt the 
structure of the network. This algorithm makes possible to create new hidden 
units and also to detect and remove inactive units. We have presented conditions 
to increase or decrease the number of neurons, based on the novelty of the data 
and on the overall behaviour of the neural system, respectively. The feasibility 
of the evolution and learning capability of the resulting algorithm for the neural 
network is demonstrated by predicting time series. 
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Abstract. This paper is concerned with the development of a naive geography 
analyst system, which can provide the analysts with image exploitation 
techniques based on naive geography and commonsense spatial reasoning. In 
the system approach, naive geography information is acquired and represented 
jointly with imagery to form cognitively oriented interactive 3-D visualization 
and analysis space, and formal representations are generated by inferring a set 
of distributed graphical depictions representing naive (commonsense) 
geographical space knowledge. The graphical representation of naive 
geography information is functional in the sense that analysts can interact with 
it in ways that are analogous to corresponding interactions with real-world 
entities and settings in the spatial environments. 



1 Introduction 

The extraction of regions of interest from imagery relies heavily on the tedious work 
of human analysts who are trained individuals with an in-depth knowledge and 
expertise in combining various observations and clues for the purpose of spatial data 
collection and understanding. In particular, since the explosion of available imagery 
data recently overwhelms the imagery analysts and outpaces their ability to analyze it, 
analysts are facing the difficult tasks of evaluating diverse types of imagery, 
producing thoroughly analyzed and contextually based products, and at the same time 
meeting demanding deadline requirements. Consequently, the exploitation task is 
becoming the bottleneck for the imagery community. This situation generates an 
urgent need for new techniques and tools that can assist analysts in the transformation 
of this huge amount of data into a useful, operational, and tactical knowledge. To 
challenge the exploitation bottleneck, we need new tools that encompass a broad 
range of functional capabilities and utilize state-of-the-art exploitation techniques. 
These new techniques should greatly speed up the analysts' ability to access and 
integrate information. Examples of such techniques include: 
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1. Superimposition techniques in which symbolic information overlaid on 
displayed imagery could provide additional exploitation clues for analyst. 

2. Automated exploitation aids which are automated target recognition (ATR) 
systems with a human in the exploitation loop. 

3. Techniques incorporating cognitive aspects of exploitation processes. 

4. Task management and interfacing techniques which incorporate mechanisms for 
efficiently dividing exploitation process tasks between people and computational 
systems. 

This paper is concerned with the development of the NG-Analyst system, which can 
provide the analysts with image exploitation techniques based on Naive Geography 
and commonsense spatial reasoning. Various publications related to NG (Naive 
Geography) reveal a variety of explorative research activities dedicated mainly to the 
formalization of commonsense geographical reasoning. 

Egenhofer and Mark [3] presented an initial definition of Naive Geography, seen 
as the body of knowledge that people have about the surrounding geographic world. 
Primary theory [10] covers knowledge for which commonsense notions and scientific 
theories correspond. Formal models of commonsense knowledge examined by 
philosophers [14,15], and commonsense physics, or Naive physics, have been an 
important topic in artificial intelligence for some time [5-7]. 

Smith claimed that the bulk of our common-sense beliefs is associated with a 
corresponding region of common-sense objects [13]. He suggested that it is erroneous 
to study beliefs, concepts and representations alone, as is done in cognitive science, 
and that it is necessary to study the objects and the object-domains to which the 
beliefs, concepts and representations relate. 

Research in the area of spatial relations provides an example in which the 
combination and interplay of different methods generate useful results. The treatment 
of spatial relations within Naive Geography must consider two complementary 
sources: (1) the cognitive and linguistic approach, investigating the terminology 
people use for spatial concepts [9,16] and human spatial behavior, judgments, and 
learning in general; and (2) the formal approach concentrating on mathematically 
based models, which can be implement on a computer[4,8,12]. The formalisms serve 
as hypotheses that may be evaluated with human-subject testing [11]. 

There has been considerable interest in the application of intelligent system 
techniques for the construction of Geographical Information Systems, spatial analysis 
and spatial decision support. Existing applications of Intelligent Systems techniques 
within GIS and related areas generally fall into one of three distinct classes: 

1. Data access and query - Fuzzy Logic has been widely used to handle imprecision 
and reason about imprecise spatial concepts (e.g. near, high, etc) in text-based 
queries. Fuzzy spatial query is one area that has attracted a great deal of interest 
[17,18]. 

2. Spatial analysis and modeling - This is the main GlS-related application area of 
intelligent techniques such as Neural Networks, Genetic Algorithms, Rule-based 
Systems, and Intelligent front ends. 

3. Expert Systems - Expert systems have been extensively assessed and used for 
decision support and knowledge encapsulation [2]. Fuzzy logic has also attracted 
a great deal of interest in recent years with, for example, applications in fuzzy 
spatial relations [1] and land use classification. Expert systems shells are 
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potentially useful as decision aids in all aspects of GIS, from the collection 
(sampling) and calibration of data to the analysis of these data and the 
construction of queries. These have been termed Intelligent GIS [2]. 



2 Naive Geography Analyst 

The NG-Analyst includes naive geography formal representation structures suitable 
for Imagery data analysis, human oriented graphical depictions of naive geography 
structures, and an environment for visual integration of naive geography depictions 
and imagery data. NG-Analyst consists of the following five functionalities (See 
Figure 1); 




4 NG Analysis 




NGAnaljsis 
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Base 
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Fig. 1. Processing Steps of NG-Analyst System 
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1. Imagery Display: Display of imagery in a 3D visualization Graphical Interface 
with the user controlled zooming, rotation, and translation capabilities, 

2. NG Entity Extraction: Analysis of imagery data for extraction of geographical 
features that can be mapped to NG entities as defined by the NG ontology, 

3. NG Graphical Object Display: Rendering of graphical objects superimposed on 
imagery data and representing extracted NG entities, 

4. NG Analysis: Processing of NG entities for their compliance with a Naive 
geography-based spatial reasoning knowledge base, and 

5. Final Display: Additional graphical objects are rendered and/or object graphical 
properties are changed in order to enhance users’ cognitive capabilities for imagery 
analysis. 



2.1 Imagery Display 

The interactive visual representation of imagery and all graphical objects associated 
with it is referred to as the Imagery NG landscape. 

The component implements the Model- View-Controller paradigm of separating 
imagery data (model) from its visual presentation (view). Interface elements 
(Controllers) act upon models, changing their values and effectively changing the 
views. 

Such a paradigm supports the creation of applications which can attach multiple, 
simultaneous views and controllers onto the same underlying model. Thus, a single 
landscape (imagery and objects) can be represented in several different ways and 
modified by different parts of an application. The controller can achieve this 
transformation with a broad variety of actions, including filtering and multi- 
resolution, zooming, translation, and rotation. The component provides navigational 
aids that enhance user’s explorative capabilities (e.g., a view from above provides a 
good overview of the information, and zooming-in for inspecting small items allows 
the user to get a detailed understanding). 



2.2 NG Entity Extraction 

The NG ontology set is defined as a taxonomy of NG entity classes that form a 
hierarchy of kinds of geographic “things”. An NG entity is subject to commonsense 
spatial reasoning processes. Examples of these processes include such constructs as: 
“Towns are spaced apart”, “Small towns are located between big towns”, and “Gas 
stations are located on highways”. The extraction of NG entities can be accomplished 
through one of the following: 

• Manual annotation by users (through a graphical interface), 

• Annotated maps and/or GIS systems, and 

• Automated detection and annotation (e.g., using classifier systems). 

In this work, the first approach has been considered and the rest will be considered in 
the future work. The NG entity representation used in this work is a vector 
representation consisting of 1) geographical feature parameters which describe 
various characteristics of geographical features, 2) graphical parameters which 
describe graphical properties of 3D objects corresponding to NG entities. 
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2.3 NG Graphical Object Display 

The extracted NG entities are visualized by the imagery display component as 
graphical objects. Graphical parameters in the NG vector representation are used in 
this visualization process. All graphical objects, superimposed on imagery display, 
form an interactive graphical map of NG entities. Effective use of geometric shapes 
and visual parameters can greatly enhance user ability to comprehend NG entities. 
This interactive map is a starting point for further analysis of relations among entities 
and their compliance with the NG Knowledge Base. 



2.4 NG Analysis 

The NG Analysis component uses a production system to implement the NG 
Knowledge Base (NGKB). NGKB defines commonsense relations governing the set 
of NG entities. Figure 2 depicts the component architecture. NGKB is implemented as 
the database of production rules. A production rule is in an IF [condition] THEN 
[action] form. A set of rules requires a rule firing control system to use them and a 
database to store the current input conditions and the outputs. The production rule can 
have one of the following syntaxes: 

IF [condition] THEN [action] 

IF [condition 1 AND condition2] THEN [action] 

IF [condition 1 OR condition2] THEN [action] 

If the condition(s) on the left-hand side are met then the rule becomes applicable 
and is ready to be fired by the Control System. 

Two types of production rules are considered: (1) Constraint Satisfaction Rules and 
(2) New NG Entity Extraction Rules. The constraint satisfaction rules govern 
commonsense relations among NG entities. Examples include: “A creek cannot flow 
uphill” and “An object cannot be in two different places at the same time.” The 
following are examples of New NG Entity Extraction Rules: 

IF (Road width > 30m) & (Number of intersections with the Road over the distance 
2km > 4) & (Road is straight) & (Road is over grass/soil area) 

THEN Generate an alert object Airfield Area. 

IF (# of Military Buildings over area of 0.5km^ > 3) 

THEN Allocate an object Military Complex. 

IF (Distance from Building#! to Building#2 < 30m) & (Type of Building#! = Type 
of Building#2) 

THEN Group objects into a new object Building Complex at higher abstraction level. 

Sometimes more than one rule may be applicable so the Control System must know 
how to select one rule to fire out of several. The reason it must only fire one rule per 
cycle is that any fired rule might affect the set of facts and hence alter which rules are 
then applicable. The set of applicable rules is known as the Conflict Set and the 
process of deciding which one to use is called Conflict Resolution. There are a 
number of different techniques used as follows: 
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Fig. 2. NG Analyst Production Rules 



Highest priority: The rules are ranked such that the highest priority rules are checked 

first so that as soon as one of them becomes applicable it is fired. This is obviously a 

very simple technique and can be quite efficient. 

• Longest matching: A rule with many conditions to be met is 'stricter' than one with 
fewer. The longest matching strategy involves firing the 'strictest' rule of the set. 
This follows the premise that the stricter condition conveys more information when 
met. 

• Most recently used: This technique takes the most recently fired rule from the 
conflict set and has the advantage of presenting a depth-first search that follows the 
path of greatest activity in generating new knowledge in the database. 

• Most recently added: Use the newest rule in the conflict set. Obviously this 
technique is only useable in systems that create and delete rules as they go along. 

2.5 Final Display 

During the final display step, imagery data is rendered together with the following 

objects: 

• Initially extracted graphical representations of NG entities, 

• Graphical representation of the activated Constraint Satisfaction rules, and 

• Graphical representation of the activated NG Entity Extraction rules. 
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1. IF objects ** Ammunition Depot" and “Highway" and “100 Meters Apart", then “Low Prob. Anununition Depot". 
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Fig. 3. NG-Analyst Visualization Interface 



The highly interactive final display supports such operations as; Landscape 
Navigation (using mouse movement and controls the user can change views hy 
zooming on the landscape, rotating around it, and/or translating its display). Semantic 
Zooming (i.e., a display of textual information) can be performed by “brushing” a 
given graphical object with the mouse pointer. Generation of Multiple Views 
(multiple landscapes can be rendered in the visualization space). Linking to 
Additional Information (moving the mouse pointer to a graphical object and clicking 
on it invokes the process of displaying additional information associated with this 
object). 



3 System Implementation 



The visualization program for the NG-Analyst system was implemented in Java and 
compiled using JDKl.4.1 Java environment and Java 3D class libraries'. The program 
displays an image and NG graphical objects. The user can select the image names 
using a pull-down menu Load Image/Object function. The file and folder names for 
each image are stored in a table of a database. The database also contains information 



' © In3D Java edition by Visual Insights. 
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on NG entity attributes (object type, probability, size, shape, etc.) as well as properties 
of graphical depictions corresponding to the NG entities (e.g., color, size, textual 
description, and position on the image landscape). The database has been built using 
Microsoft Access Database. A small production rule system was also implemented in 
this prototype version in order to provide analytical capabilities. The user can analyze 
the displayed NG graphical objects for their compliance with the rule set. Figure 3 
shows the visualization program interface. 



4 Experimental Evaluation 

The imagery database used during an experimental evaluation contained an aerial 
imagery set from the Kosovo operation Allied Force. In evaluating the initial NG- 
Analyst system, the following aspects were considered: 

• Simplicity of use, 

• Navigational locomotion, 

• Compatibility and standards, 

• Completeness and extensibility, and 

• Maintenance. 

Three evaluation approaches have been used: formal (by means of technical 
analysis), empirical (by means of experiments involving the users) and heuristic 
(judgments and opinions stated after the interaction with the system, e.g., performed 
by looking at the NG- Analyst interface and trying to come up with an opinion about 
what is complete and deficient about it). 



5 Conclusion 

In this paper, we present an NG-Analyst system which includes naive geography 
formal representation structures suitable for imagery data analysis, human oriented 
graphical depictions of naive geography structures, and an environment for visual 
integration of naive geography depictions and imagery data. The NG-analyst has been 
developed to enhance the analysts’ performance and to help to train new analysts by 
capturing various cognitive processes involved to human spatial data interpretation 
tasks. As a future plan, we will develop the automated detection/annotation approach 
for the extraction of NG entities instead of manual annotation by users through a 
graphical interface. 
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Abstract. This paper makes two contributions to increase the engage- 
ment of users in virtual heritage environments by adding virtual living 
creatures. This work is carried out on the context of models of the Mayan 
cities of Palenque and Calakmul. Firstly, it proposes a virtual guide who 
navigates a virtual world and tells stories about the locations within it, 
bringing to them its personality and role. Secondly, it develops an archi- 
tecture for adding autonomous animals to virtual heritage. It develops 
an affective component for such animal agents in order to increase the re- 
alism of their flocking behaviour and adds a mechanism for transmitting 
emotion between animals via virtual pheromones, modelled as particles 
in a free expansion gas. 



1 Introduction 

Nowadays, virtual environments are becoming a widely-used technology as the 
price of the hardware necessary to run them decreases. Many recently developed 
virtual environments recreate real spaces with an impressive degree of realism. In 
such contexts, however, a lack of information for the user is frequently perceived, 
which makes him lose his interest in these environments. In the real world, people 
relate the environments that surround them to the stories they know about the 
places and objects in the environment. Therefore, in order to obtain more human 
and useful virtual environments, we need to add a narrative layer to them. We 
need stories related to the places and objects in the world. And finally, we need 
a virtual guide able to tell us these stories. Furthermore, one of the most striking 
features of historical investigations is the coexistence of multiple interpretations 
of the same event. The same historical events can be told as different stories 
depending on the storyteller’s point of view. It would be interesting that the vir- 
tual guide who tells us stories about the virtual environment she ^ inhabits could 

^ In order to avoid confusion, in this paper, the virtual guide is supposed to be female, 
while the human guide is supposed to be male 
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tell us these stories from her own perspective. In this sense, the first part of this 
paper describes the design and development of a novel proposal for storytelling 
in virtual environments from a virtual guide perspective. 

On the other hand, in order to obtain more believable virtual environments, 
some research groups are trying to simulate virtual animals. In nature usually 
animals behave as part of social groups. Therefore, if we want to populate virtual 
environments with believable virtual animals, we need to emulate the behaviour 
of animal groups. Furthermore, our assumption of life and smartness in real 
animals derives from our perception of their reaction to the environment. For 
example, if one runs through a flock of deer, we expect them to move in order 
to avoid us. This reaction seems driven by emotional stimulus (most likely fear), 
and is communicated amongst conspecifics at an emotional level. Thus, believ- 
able virtual animals should display this kind of emotional response and com- 
munication. The second part of this paper proposes an architecture to model 
and simulate animals that not only “feel” emotions, that affect their decision 
making, but are also able to communicate them through virtual pheromones. 
The virtual animals also show a group behaviour, in particular a flocking be- 
haviour based on the boids algorithm [11] modified so that it takes into account 
the animals’ emotions. In this sense, the emotional communication ’’drives” the 
flocking emergence. 

This paper describes the design and development of a novel proposal for 
storytelling in virtual environments from a virtual guide perspective and how it 
can be used together with animal group behaviour to add relevance to the virtual 
heritage instalations of Palenque and Calakmul. The structure of the paper is as 
follows. First we describe the proposed system for storytelling. Then we detail 
the architecture for the virtual deer. Next we expose the implementation and 
preliminary results. Finally we show the conclusions and point out future work. 



2 Narrative Construction 

In our model the guide begins at a particular location and starts to navigate the 
world telling the user stories related to the places she visits. Our guide tries to 
emulate a real guide’s behaviour in such a situation. In particular, she behaves as 
a spontaneous real guide who knows stories about the places but has not prepared 
an exhaustive tour nor a storyline. Furthermore, our guide tells stories from her 
own perspective, that is, she narrates historical facts taking into account her own 
interests and roles. In fact, she extends the stories she tells with comments that 
show her own point of view. This mixture of neutral information and personal 
comments is what we can expect from a real guide who, on the one hand, has to 
tell the information he has learnt, but on the other hand, cannot hide his feelings, 
opinions, etc. We have designed a hybrid algorithm that models a virtual guide 
behaviour taking into account all the aspects described above. The mechanisms 
involved in the algorithm can be separated in three global processes which are 
carried out with every step. The next subsections detail these phases. 
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2.1 Finding a Spot in the Guide’s Memory 

Given a particular step in the navigation-storytelling process (that is, the virtual 
guide is at a particular location and she has previously narrated a series of 
story pieces), the guide should decide where to go and what to tell there. To 
emulate a real guide’s behaviour, the virtual guide evaluates every candidate pair 
(story element, location) taking into account three different factors: the distance 
from the current location to location, the already told story elements at the 
current moment and the affinity between storyelement and the guide’s profile. 

A real guide will usually prefer nearer locations, as further away locations 
involve long displacements which lead to unnatural and boring delays among 
the narrated story elements. In this sense, our guide prefers nearer locations 
too. When a real guide is telling stories in an improvisational way, the already 
narrated story elements make him recall, by association, related story elements. 
In a spontaneous way, a real guide tends to tell these recently remembered 
stories. In this sense, our guide prefers story elements related (metaphorically 
remembered) to the ones previously narrated. Finally, a real guide tends to tell 
stories related to his own interests or roles. In this sense, our guide prefers story 
elements related to her own profile. 

The system evaluates every candidate pair (storyelement, location) such that 
there is an entry in the knowledge base that relates storyelement to location 
(note that this means that storyelement can be narrated in location) and such 
that storyelement has not been narrated yet. In particular three scores corre- 
sponding to the previously commented factors are calculated. These three scores 
are then combined to calculate an overall score for every candidate pair. Finally 
the system chooses the pair with the highest overall score value. 

2.2 Extending and Contextualising the Information 

Figure la represents a part of the general memory the guide uses. This memory 
contains story elements that are interconnected with one another in terms of 
cause-effect and subject-object relations. Figure lb shows the same part of the 
memory, where a story element has been selected by obtaining the best overall 
score described in the previous section. If the granularity provided by the selected 
story element is not considered to be large enough to generate a little story, 
then more story elements are selected. The additional story elements are chosen 
according to particular criteria (cause-effect and subject-object in our case). 
This process can be considered as navigating the memory from the original story 
element. Figure Ic shows the same part of the memory, where three additional 
story elements have been selected by navigating from the original story element. 

The selected story elements are translated, if possible, from the virtual guide 
perspective (see figure Id). For this task the system takes into account the guide 
profile and meta-rules stored in the knowledge base that are intended to situate 
the guide perspective. The translation process also generates guide attitudes 
that reflect the emotional impact that these story elements cause her. Lets see 
a simple example. Let us assume the following information extracted from a 
selected story element 
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fact (colonization, Spanish, mayan) 

meaning that the Spanish people colonized the Mayan. And let us assume the 
following meta-rules included in the knowledge base, aimed to situate the guide 
perspective 

fact (colonization. Colonizer, Colonized) and prof ile (Colonized) => 
f act (colonizedColonizacion, Colonizer, Colonized) and 
guideattitude (anger) 

meaning that a colonizedColonization fact and anger as the guide’s attitude 
should be inferred if a colonization fact is included in the story element and 
the guide profile matches the third argument of this fact, that is, the guide is 
the Colonized. In this example that will happen if the guide is Mayan. The new 
inferred fact represents the original one but from the guide’s perspective. 

In addition, the new translated story elements are enhanced by means of new 
information items generated by inferring simple commonsense rules allowing to 
add some comments showing her perspective. The guide uses the new contextu- 
alised story elements (figure Id) as input for the rules that codify commonsense 
(figure le). By inferring these rules the guide obtains consequences that are 
added to the contextualised story elements (figure If), obtaining a new data 
structure which codifies the information that should be told. Let us continue 
with the previous example. Let us assume the following commonsense rule 

fact (colonizedColonizacion, Colonizer, Colonized) => 
f act (culturalDestruction, Colonized) and 
f act (religionChange , Colonized) 

meaning that the colonized’s view implies the destruction of the colonized’s 
culture and the change of the colonized’s religion. Therefore, if in our example 
the guide were Mayan, the story element to be told would be enhanced with the 
facts culturalDestruction and religionChange. 




extend the selected elements 



Fig. 1. Storyboard construction 
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2.3 Generating the Story 

As a result of the previous processes, the guide obtains a set of inter-related 
information items to tell (figure If). Some elements are also related to particular 
guide attitudes. Now the system generates the text to tell (expressing these 
elements) as well as special effects and guide’s actions to show while telling the 
story. The phases of this story generation process are as follows: 

1. The first step is to order the data elements. To do so we consider three 
criteria: cause-effect (if an element Y was caused by another element X, 
then X should precede Y), subject- object (the elements whose subject /object 
are similar should be grouped together) and classic climax (the first selected 
story element, i.e. the one that obtained the best overall score, is supposed 
to be the climax of the narration, and therefore all the rest of the elements 
are arranged taking it into account). 

2. The text corresponding to the ordered set of elements is generated. The 
complexity of this process depends on the particular generation mechanism 
(we use a template system) and the degree of granularity employed (we use 
a sentence per every story element). 

3. A process that relies on the guide expression rules (the set of rules that 
translate abstract guide’s attitudes in particular guide’s actions) generates 
a set of guide actions (each one related to a particular story element). 

4. Every story element is associated to particular environment conditions or 
special effects. Thus, finally, a storyboard like the one shown in figure Ig is 
obtained. 

3 Communicating Emotions and Group Behaviour 

3.1 Overall Architecture 

The basic task of an animal brain has often been split into three sub-tasks. 
Our model adds a fourth sub-task, emotions. The four sub-tasks in our system 
are therefore: perception, emotions, action selection and motor control. Figure 
2 shows a detailed diagram of the designed architecture, and the next sections 
describe its components. 

3.2 Communicating Emotions 

In the real-world, emotional transmission may well be multi-modal, with certain 
modes such as the perception of motion being particularly difficult to model. 
Thus we have limited ourselves for now to a single mode, and the one we have 
chosen is pheromones, to be perceived by a virtual olfaction sensor. 

The nose has been linked with emotional responses and intelligence. Recent 
experiments [5] have shown that mammals emit pheromones through apocrine 
glands as an emotional response, and as means to communicate that state to 
conspecifics, who can adapt their behaviour accordingly; research has found that 
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AGENT’S BRAIN 




Fig. 2. Detailed architecture 



odours produce a range of emotion responses in animals [6], which is adaptively 
advantageous because olfaction is part of the old smell-brain which can gener- 
ate fast emotional-responses. Neary [lOjpoints out that sheep will usually move 
more readily into the wind than with the wind, allowing them to utilise their 
sense of smell. In real animals chemoreceptors (exteroceptors and interoceptors) 
are used to identify chemical substances and detect their concentration. In our 
architecture we intend to model the exteroceptors which detect the presence of 
chemicals in the external environment. 

In this work, to illustrate the use of emotion and drives to influence be- 
haviours, deer had been selected as the exemplar creature. To support the com- 
munication of emotions, an environmental simulator has been developed, its 
tasks include changing the temperature and other environmental variables de- 
pending on the time of day and on the season, which depends on statistical 
historical data. An alarmed animal sends virtual pheromones to the environ- 
mental simulator and they are simulated using the free expansion gas formula 
in which the volume depends on the temperature and altitude (both simulated 
environmental variables) . To compute the distribution of the pheromones a set 
of particles has been simulated using Boltzmann distribution formula 1, which 
is shown and described next. 



mgy rp 

n (y) = Uoe ( 1 ) 

Where m is the pheromone’s mass; g is the gravity; y is the altitude; kb is the 
Boltzmann number; T is the temperature; rzo is N/V; N is number of molecules 
exhuded by the apocrine gland, which is related to the intensity of the emotional 
signal; and V is the Volume. The virtual animal includes a virtual nose used to 
detect pheromones, if any, that are near the creature. To smell a pheromone the 
threshold set in the current experiment is 200 x 10“^® because [8] has shown that 
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animals have 1000 to 10000 more sensitivity than humans and Wyatt [15] claims 
that the threshold in humans to detect certain “emotional response odours” , like 
those exuded from the armpits, is 200 per trillion parts, that is 200 x 10“^^. 



3.3 Action Selection 

The problem of action selection is that of choosing at each moment the most 
appropriate action out of a repertoire of possible actions. The process of making 
this decision takes into account many stimuli, including (in our case) the ani- 
mal’s emotional state. Action selection algorithms have been proposed by both 
ethologists and computer scientists. The models suggested by the ethologists 
are usually at a conceptual level, while the ones proposed by computer scien- 
tists (with some exceptions as [14] and [2]) generally do not take into account 
classical ethologic theories. According to Dawkins [3], a hierarchical structure 
represents an essential organising principle of complex behaviours. This view is 
shared by many ethologists [1] [13], and some action selection models follow this 
approach. Our action selection mechanism is based on Tyrrell’s model [14]. This 
model is a development of Rosenblatt & Payton’s original idea [12] (basically a 
connectionist, hierarchical, feed- forward network), to which temporal and uncer- 
tainty penalties were added, and for which a more specific rule for combination 
of preferences was produced. Note that among other stimuli, our action selection 
mechanism takes the emotional states of the virtual animal. 



3.4 The Flocking Behaviour 

The flocking behaviour in our system is based on boids [11], although we have 
extended it with an additional rule (escape), and, most importantly, the flocking 
behaviour itself is parameterised by the emotional devices output, that is, by 
the values of the emotions the boids feel. The escape rule is used to influence 
the behaviour of each boid in such a way that it escapes from potential danger 
(essentially predators) in its vicinity. Therefore, in our model each virtual animal 
moves itself along a vector, which is the resultant of four component vectors, 
one for each of the behavioural rules, which are: Cohesion (attempt to stay 
close to nearby flockmates). Alignment (attempt to match velocity with nearby 
flockmates). Separation (avoid collisions with nearby flockmates) and Eseape 
(escape from potential danger, predators for example). The calculation of the 
resultant vector. Velocity, for a virtual animal A is as follows: 

Vh = {Cf ■ Cef ■ Cv) + {Af ■ Aef ■ Av) + {Sf ■ Sef ■ Sv) + {Ef ■ Eef ■ Ev) (2) 

' V ^ ' V ' ' V ^ ' V ' 

Cohesion Alignment Separation Escape 

VelocityA = limit{VA, {MVef ■ MaxVelocity)) (3) 

where Cv, Av, Sv and Ev are the component vectors corresponding to the 
cohesion, alignment, separation and escape rules respectively. Cf, Af, Sf and 
Ef are factors representing the importance of the component vectors Cv, Av, 
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Sv and Ev respectively. These factors allow to weight each component vector 
independently. In our current implementation they can be varied, in real time, 
from a user interface. Cef, Aef, Sef and Eef are factors representing the 
importance of the component vectors Cv, Av, Sv and Ev respectively, given 
the current emotional state of the virtual animal. That is, each of this factors is 
a function that take the current values of the animals emotions and generate a 
weight for its related component vector. MaxVelocity is the maximum velocity 
allowed to the animal. In the current implementation it can be varied from a user 
interface. MVef is a factor whose value is calculated as a function of the current 
values of the animals emotions. It allows to increase and decrease the animal’s 
MaxVelocity depending on its emotional state, limit is a function whose value 
is equal to its first parameter if this is not greater than its second one, otherwise 
the function value is equal to its second parameter. 



Emotional devices 



+ 1 




Fig. 3. How fear parameterises the flocking algorithm 



The emotional factors {Cef, Aef, Sef, Eef, and MVef) reflects ethologic 
heuristic rules. Figure 3 shows an example of how emotions parameterise the 
flocking behaviour. In particular it shows how fear affects the component vec- 
tors of the animals’ behaviour. The greater the fear an animal feels, the greater 
the weight of both its cohesion vector (the animal try to stay closer to nearby 
flockmates) and its escape vector (the boid try to stay farther from the potential 
danger). The resultant vector obtained by adding the four basic vectors is then 
scaled to not exceed the maximum speed. This maximum velocity is parame- 
terised by the fear as well. The greater the fear an animal feels, the greater the 
speed it is able to reach. 

4 Implementation and Preliminary Results 

We have chosen Unreal Tournament (UT) engine as the platform on which our 
virtual guide run. As we wished our system to be open and portable, we decided 
to use Gamebots to connect our virtual guide to UT. Gamebots [7] is a mod- 
ification to UT that allows characters in the environment to be controlled via 
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network sockets. The core of the virtual guide is a Java application which is able 
to connect to UT worlds through Gamebots. This application controls the move- 
ment of the guide in the world as well as the presentation of special effects and 
texts which show the generated narratives. The current version uses a MySQL 
[9] database to store the knowledge base and Jess [4] to carry out inferences on 
the information. The described system has been developed (see figure 4a) and it 
is working properly with small and medium size knowledge bases. 

On the other hand, the implementation of the emotional virtual animals 
architecture is three layered. Namely the agent’s brain, the world model and the 
virtual environment. As seen in figure 2 the agents’ brains are processes that 
runs independently on a Linux workstation and each agent’s brain receives the 
sensorial data via network sockets and sends the selected action to the world’s 
model which contains the agents’ bodies and the environmental simulation. The 
changes to this model are reflected in the Palenque virtual environment (see 
figure 4c) which was developed using OpenGL Performer. This mechanism allows 
modularity and extensibility to add/modify the behaviour of the virtual animals. 
Furthermore, the behaviour of the deer in the Galakmul environment (see figure 
4b) was prototyped in VRML and Java. The tests carried out so far, although 
preliminary results, show that the system is able to cope with the problem of 
simulating flocks of virtual animals driven by its emotional state. 




Fig. 4. User and administrator interfaces of the virtual guide (a). Snapshot of the 
Galakmul (b) and Palenque (c) implemented systems 



5 Conclusions and Future Work 

In this paper we have described work which aims to put “life” into virtual her- 
itage environments in two distinct ways. Firstly, we discussed work towards the 
creation of an “intelligent guide with attitude”, who tells stories in a virtual 
heritage environment from her distinct point of view. Work will continue in con- 
junction with groups in Mexico who have already produced a virtual models of 
the Mayan Gities of Palenque and Galakmul (whose location in the middle of a 
jungle makes it particularly inaccessible in the real world). We believe that the 
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growing popularity of virtual heritage produces a growing need for intelligent 
guides and that this work will therefore find many potential applications. 

Secondly, this work has shown that it is feasible to bring together the simple 
rule-based architectures of flocking with the more complex architectures of au- 
tonomous agents. Initial results suggest that populating a virtual heritage site 
with virtual animals can improve the experience particularly if the animals are 
engaged in some autonomous activity. This produces more believable and more 
specific flocking behaviour in the presence of predators. Further work will be 
carried out to more accurately characterise the changes in flocking behaviour 
obtained by this extended architecture. We further plan to validate this work by 
modelling a different flocking animal, for example the musk ox, which responds 
to predators by forming a horns-out circle of adult animals with the young inside. 
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Abstract. In this paper we describe a querying model that allows users 
to find virtual worlds and objects in these worlds, using as a base a 
new virtual worlds representation model and a fuzzy approach to solve 
the queries. The system has been developed and checked in two differ- 
ent kinds of worlds. Both the design and current implementation of the 
system are described. 



1 Introduction 

Nowadays, virtual environments are a commonly used technology while the price 
of the hardware necessary to run them is decreasing. Current video games show 
3D environments unimaginable some years ago. Many recently developed vir- 
tual environments recreate real spaces with an impressive degree of realism. At 
the same time, some digital cities are being created on the Internet including 
interactive 3D representations of their real cities. 

Although the three-dimensional digital world offers enormous possibilities in 
the interaction between the user and the world of virtual objects, it has the same 
or more problems than the two-dimensional one. We show an example of this. If 
a friend tells us that she saw a web page with information about a new Where’s 
Wally book authored by Martin Handford, but she forgot the web page URL, 
we could connect to a web search engine, Google for example, and search for 
Martin Handford and Wally and 2003. In a few minutes we would be watching 
the book cover. However, if she tells us she visited an interesting virtual world 
where Wally himself was, how could we find it? Furthermore, even if we find the 
world, how could we find Wally in it? Definitely, if Wally decides to hide himself 
in the Internet, he will be more successful if he does so in a virtual world. The 
problem emerges when we try to access virtual worlds defined by means of sets 
of three-dimensional primitives which have nothing to do with our language. 
Therefore it is necessary that the search process and the virtual worlds share 
the same kind of representation. 

The problem, however, is no that simple. Let us suppose that our friend 
has visited a huge virtual environment where there are several Wally’s. She was 
especially surprised by an area where there was a tiny Wally. How could we find 
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this location in that world? In the best case Wally’s height is codified as a num- 
ber, while our visual perception is imprecise. Therefore, the search mechanisms 
should be able to deal with imprecise queries like: where is tiny Wally?. 

In this sense, we propose a new virtual worlds representation model that 
requires just a few additional efforts from the world creators, and adds a basic 
semantic level to the worlds which is useful to improve the interaction of the users 
with these worlds. We also describe a querying model that allows users to find 
worlds and objects in these worlds, using as a base the proposed representation 
and a fuzzy approach to solve the queries. Both proposed models taken together 
improve the current interaction with virtual worlds. 

The structure of the paper is as follows. First we review related work. Next 
we expose our proposal by showing the virtual worlds representation model and 
detailing the fuzzy query mechanisms. Then we outline the current implementa- 
tion and preliminary results. Finally we provide the conclusions and point out 
future work. 



2 Related Work 

There are various works [14,16,10,11] which present virtual environments with 
a semantic information layer. Some of them add the semantic level to the vir- 
tual environments, others add the virtual environments to pre-existing semantic 
information (GIS, digital cities, etc.). 

On the other hand, several researchers have been working in what has been 
called flexible querying, whose objective is to provide users with new interroga- 
tion capabilities based on fuzzy criteria. Flexible querying has been applied to 
both the database querying and the information retrieval problems. The first 
fuzzy approach to database querying is probably due to Tahani [15], who de- 
fined the concept of a fuzzy relation in a database by associating a grade of 
membership with each tuple. He introduced the usual set operations (union, dis- 
junction, negation and difference) and the AND and OR operators. His works 
were followed by numerous contributions which introduced the definitions of 
new operations [3,5], and the introduction of new concepts, such as the so-called 
quantified queries [23,7]. Furthermore, some works have been oriented towards 
the implantation of such systems in the Internet [12,13]. In the area of infor- 
mation retrieval several models have been proposed [4,2]. Finally, a preliminary 
investigation of the potential applications of fuzzy logic in multimedia database 
is presented in [9]. 



3 Proposed System 

Figure 1 shows the overall architecture of the proposed system. The next sub- 
sections describe the information about the worlds that the system uses, and the 
query system itself. 
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Fig. 1. Global architecture 



3.1 Representing Virtual Worlds 

The current description formats of virtual worlds describe mainly geometric 
information (points, lines, surfaces, etc.), which is required by the browsers to 
visualize the worlds. Our system adds a new semantic information level to the 
worlds representation, getting more suitable worlds for the interaction with users, 
and particularly more suitable to be queried in relation to their contents. The 
system uses concrete object information as well as general information needed 
to contextualise as described below. 

Concrete Object Information. In particular, we annotate the following meta- 
contents: location, orientation, width, height, depth, spatial containment relations, 
identifier and type. Note that the first six are meta-contents which can be auto- 
matically extracted, while the last two should be manually annotated. Note also 
that additional features can be calculated from the annotated ones. For example, 
object size can be reckoned as the product of the object width, height and depth. 

Information Needed to Contextualise. This information is automatically 
obtained from the meta-contents data. In particular, maximum and minimum 
values of width, height, depth and size attributes are obtained for every type 
of object in every context considered. The contexts we consider are the set of 
all the worlds, every particular world, and some objects (spaces) of particular 
worlds. In addition maximum possible distance in every context is calculated. 

3.2 Querying Virtual Worlds 

We require a modelling of a query system able to solve vague user queries working 
with more precise knowledge bases with some uncertainty. To model this query 
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system we use an approach based on fuzzy set theory [19] and fuzzy logic. In 
fact, the main contribution of fuzzy logic is a methodology for computing with 
words [22]. 

Next we describe the main elements involved in our system. First we describe 
fuzzy dictionaries which encode the information the fuzzy query solver needs to 
work. Then we expose the canonical form to which the queries are translated to 
be processed by the fuzzy query solver. Finally, we show the fuzzy query solver 
functioning by describing how each of its modules works. 



Dictionaries. The central key of the fuzzy approach is the fuzzy set concept, 
which extends the notion of a regular set in order to express classes with ill- 
defined boundaries (corresponding to linguistic values, e.g. tall, big, etc.). Within 
this framework, there is a gradual transition between non-membership and full 
membership. A degree of membership is associated to every element cc of a ref- 
erential X. It takes values in the interval [0,1] instead of the pair {0,1}. Fuzzy 
modelling techniques make use of linguistic hedges as fuzzy sets transformers, 
which modify (often through the use of the basic concentrator, dilator and inten- 
sifier modifiers) the shape of a fuzzy set surface to cause a change in the related 
truth membership function. Linguistic hedges play the same role in fuzzy mod- 
elling as adverbs and adjectives do in language: they both modify qualitative 
statements. For example, very is usually interpreted as a concentrator using the 
function f{x) = x^. A linguistic variable is a variable whose values are words or 
sentences in a natural or synthetic language. For example, Height is a linguistic 
variable if its values are short, not short, very short, tall, not tall, very tall, and 
so on. In general, the values of a linguistic variable can be generated from a pri- 
mary term (for example, short), its antonym (tall), a collection of modifiers {not, 
very, more or less, quite, not very, etc.), and the connectives and and or. For 
example, one value of Height may be not very short and not very tall. Fuzzy set 
theory also attempts to model natural language quantifiers by operators called 
fuzzy quantifiers. 

Our query system includes various dictionaries which define basic operations 
(conjunctions, disjunctions, negations), basic modifiers (dilation, concentration, 
intensification), hedges (very, more or less, etc.), linguistic values (tall, big, near, 
etc.), linguistic variables (width, height, depth, size, distance, etc.) and quanti- 
fiers (most, many, etc.). 



Canonical Form. As pointed out by Zadeh [21], a proposition p in a natural 
language may be viewed as a collection of elastic constraints, C\,...,Ck, which 
restrict the values of a collection of variables X = (Ai,...,A„). In general the 
constraints, as well as the variables they constrain, are implicit rather than 
explicit in p. Viewed in this perspective, representation of the meaning of p is, in 
essence, a process by which the implicit constraints and variables in p are made 
explicit. In fuzzy logic, this is accomplished by representing p in the so-called 
canonical form P — >■ A is A in which A is a fuzzy predicate or, equivalently, an 
n-ary fuzzy relation in U, where U = Ui x U 2 x ... x Un, and Ui,i = l,...,n, 
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is the domain of We show the process of making the implicit variables and 
restrictions explicit through an example. Let the flexible query be I am searching 
for a park which has a tall tree. As the dictionaries relate both linguistic variables 
with its possible linguistic values and object attributes with linguistic variables, 
the system relates the linguistic value tall with the linguistic variable heigth and 
this in turn, with the world objects attribute height. Then, the system makes 
X = Height (tree) and A = TALL explicit. Therefore the explicit restriction is 
Height(tree) is TALL. Note that if the query is one that presents ambiguity (the 
linguistic value is related with various linguistic variables), it should be solved 
by interacting with the user. 

We have defined a format to describe the queries, which is a kind of canonical 
form in the sense that it makes the implicit variables and restrictions explicit. 
The queries are translated to this format independently of the interface modal- 
ity (natural language interface, graphical form interface). We show the format 
through an example. Let the flexible query be I am searching for a world which 
has a park which has a tall tree which has many nests; its representation is [ac- 
tion: searching for, [object: world, quantity: 1, restrictions: [has: [object: park, 
quantity: 1, restrictions: [has: [object: tree, quantity: 1, restrictions: [height: tall, 
has: [object: nest, quantity: many[[[[[[[[. 



Fuzzy Query Solver. In this section we show the fuzzy query solver functioning 
by first describing how each of its main four modules works and then exposing 
the overall operation of the complete system. 

Atomic Sub queries. Let the flexible query be I am searching for a world which 
has a tall tree. To calculate the degree to which a particular world fulfils the 
query, we have to evaluate first the satisfaction of X is tall, where X = 
Height(tree), i.e. the height attribute of a tree object of the world being 
considered. This satisfaction degree is calculated in a two-step process. First 
the numeric height X of the tree object is contextualized (scaled in this 
case) obtaining X^ontext- Then Htaii{X context) is calculated where Utaii is 
the membership function corresponding to the fuzzy set associated with the 
fuzzy term tall. 

Aggregation. Let the flexible query be I am searching for a world which has 
a tall tree and a big garden. To calculate the degree to which a particular 
world fulfils the query, first we calculate the degrees to which X and Y 
fulfil the atomic subqueries X is tall and Y is big respectively, where X = 
Height (tree) and Y = Size (garden), and tree and garden are objects of the 
world being considered. Then we can use the conjunction (AND) to calculate 
the aggregation of both degrees. 

Note that the classic AND and OR connectives allow only crisp aggregations 
which do not capture any vagueness. For example, the AND used for aggre- 
gating n selection criteria does not allow to tolerate the unsatisfaction of a 
single condition; this may cause the rejection of useful items. For example, 
let the query be I am searching for a room which has a bed and a wardrobe 
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and a bedside table. It seems obvious that the user is searching for a bed- 
room, however a bedroom with no bedside table will be rejected if we use the 
conjunction to aggregate the degrees of the atomic fulfilments. We supple- 
ment the conjunction and disjunction connectives by a family of aggregation 
criteria with an intermediate behaviour between the two extreme cases cor- 
responding to the AND and to the OR. These aggregations are modelled by 
means operators and the fuzzy linguistic quantifiers (see below) . 

Context Dependency. The meaning of a fuzzy term, such as tall, may have 
several meanings among which one must be chosen dynamically according 
to a given context [24]. As showed above, maximum and minimum values 
of width, height, depth and size attributes are obtained for every type of 
object in every context considered. Then, these values are used to get the 
contextualized meaning of fuzzy terms. For example, if I search for a tall 
tree, being myself in a virtual park in a particular world, then the meaning 
of tall is obtained contextualizing its generic definition, in this particular 
case scaling with respect to the maximum and minimum values of the tree 
heights in this park. 

We consider three main factors to contextualize the fuzzy terms in queries: 
world immersion (is the user in a world ?), location in a world, (which is the 
location of the user in the world she is inhabiting ? which is the minimum 
object (space) that contains the user ?) and query context (which is the 
context of the fuzzy term being considered in the query where it appears ?). 
Then, a simple algorithm decides the context of every fuzzy term in a query 
taking into account these factors. 

Linguistic Quantifiers. Let the flexible query be I am searching for a world 
which has many tall trees. To calculate the degree by which a particular 
world fulfils the whole query, first we have to calculate the contextualized 
quantification of the degrees to which every tree object in this world fulfils 
the atomic subqueries X is tall where X = Height(tree). Then the degree 
by which this quantification Q fulfils Q is many is calculated. Various inter- 
pretations for quantified statements have been proposed in literature. The 
classical approach is due to Zadeh. The most currently accepted derives from 
Yager. In [8], Bose and Pivert compare these methods to evaluate quantified 
statements. 

Zadeh [20] proposed viewing a fuzzy quantifier as a fuzzy characterization of 
an absolute or relative cardinality. The advantage of Zadeh’s approach is its 
simplicity. However, it does not permit differentiating the case where many 
elements have a small membership degree and the case where few elements 
have a high membership degree. In [17] Yager introduced the concept of 
a weighted ordered averaging (OWA) operator. This operator provides a 
family of aggregation operators which have the conjunction at one extreme 
and the disjunction at the other extreme. Yager showed the close relationship 
between the OWA operators and the linguistic quantifiers. In particular, he 
suggested a methodology for associating each regular monotonic increasing 
quantifier with an OWA operator. Later it has been shown [6,18] that it is 
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possible to extend this method in order to represent monotonous decreasing 
and increasing-decreasing quantifiers. Our system employs Yager’s approach. 

By using these modules, the overall system works as follows. First the query 
to be solved is translated to canonical form. Then every atomic subquery is 
contextualised and solved (that is, the degree to which every candidate solution 
fulfils the atomic subquery is calculated). Afterwards the compound subqueries 
(aggregations, quantifications) are contextualised and solved following a bottom- 
up order through the hierarchical structure of the canonical form. For example, 
let the flexible query be: I am looking for a park which contains a very tall tree 
and many short trees. The query is first translated to the following sentence in 
normal form: 

[action: searching for, 

[object: park, quantity: 1, restrictions: 

[has : 

[composition: and 

[object: tree, quantity: 1, restrictions: [height: very tall]] 

[object: tree, quantity: many, restrictions: [height: short]] 

] 

] 

] 

] 



Then, for each case, every sub-query is solved, and after that the aggregation 
(and in the example) is calculated. In addition, we show a list of queries, in order 
to illustrate the type of queries the system is able to work with: I am looking for 
a tall tree; I am looking for a very tall tree; I am looking for a park containing 
a very tall tree; I am looking for a tall tree which is near a library; I am looking 
for a tall tree which is near a library and contains a nest; I am looking for a 
park which contains a tall tree or a tiny Wally; I am looking for a park which 
contains many trees. 

4 Implementation and Preliminary Results 

We have developed and successfully checked the proposed system. In particular, 
we have implemented the fuzzy dictionaries and fuzzy query solver in SWI- 
Prolog [1] (a free prolog system). In addition we have developed the complete 
architecture shown in figure 2. 

User accesses VRML virtual environments through a web browser containing 
a VRML plug-in (in particular we have used Cortona for our tests) . The worlds 
contain a VRML Java Script Node which controls the user location and point of 
view. Furthermore, an EAI Java applet deals with the network communications 
and user interface. 

The queries the user enters in the user interface are sent to the server where 
they are translated to canonical form and processed by the fuzzy query solver. 
The answer generated by this engine (a list of possible destinations where the 
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Fig. 2. Implemented architecture 



objects/zones searched could be located, along with a numerical evaluation of 
every one) is sent back to the java applet in the user’s web browser. This list 
is then shown on the user’s interface so that she can navigate the world toward 
her desired target just by clicking on the preferred destination. 

In order to use our system, only a few steps should be followed by the VRML 
world creators: install the database, save annotations about their worlds in the 
database, install the prolog fuzzy query engine on the server where their web 
server is running, and add a couple of lines of code to every VRML file so that 
it is controllable by the Java program. 

So far we have checked the system with two different kinds of worlds. First 
we used the system to query in a virtual world which represents the actual 
Center of Environmental Education CEMACAM of the Mediterranean Savings 
Bank (CAM), composed by five buildings (see figure 3a). This world is a good 
example of the kind of virtual worlds which emulates reality. In this world queries 
like: I am searching a room with many computers, I am looking for a small room 
with a big door, etc. are properly solved. 




Fig. 3. Test worlds 
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Next we developed a Java application which automatically generates huge 
toy virtual worlds composed of several gardens with objects of different types 
and sizes. By using this application we generated several worlds which allow us 
to check the system in deliberately weird fictitious worlds (see figure 3b). In 
this world queries like: I am searching a garden with many tall trees and a very 
big monument, I am looking for a very small tree near a very tall tree, etc. are 
properly solved. 

The tests carried out so far show that the proposed system is able to cope with 
the problem of querying in virtual environments. It seems that users feel more 
confident in the worlds when the querying system is available. At the moment 
we are checking the system with non experienced users, to evaluate how useful 
the system is. 

5 Conclusions and Future Work 

In this paper we have described a proposal for a new virtual worlds representa- 
tion model that requires just a few additional efforts from the worlds creators, 
and adds a basic semantic level to the worlds which is useful to improve the 
interaction of the users with these worlds. We also have described a querying 
model that allows users to find worlds and objects in these worlds, using as a 
base the proposed representation, and a fuzzy approach to solve the queries. 
Both proposed models taken together improve the current interaction with vir- 
tual worlds. We have developed and successfully checked the system. Further 
work will be carried out in order to check the system with non experienced users 
and extend the fuzzy dictionaries (to allow more kinds of queries) . 
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Abstract. This work presents the development of an automatic recognizer of 
infant cry, with the objective of classifying three kinds of cry, normal, 
hypoacoustic and asphyxia. We use acoustic characteristics extraction 
techniques like LPC and MECC, for the acoustic processing of the cry's sound 
wave, and a Feed Forward Input Delay neural network with training based on 
Gradient Descent with Adaptive Back-Propagation. We describe the whole 
process, and we also show the results of some experiments, in which we obtain 
up to 98.67% precision. 

Keywords: Infant's Cry, Classification, Pattern Recognition, Neural Networks, 
Acoustic Characteristics. 



1 Introduction 

The pathological diseases in infants are commonly detected several months, often 
times years, after the infant is born. If any of these diseases would have been detected 
earlier, they could have been attended and maybe avoided by the opportune 
application of treatments and therapies. It has been found that the infant's cry has 
much information on its sound wave. For small infants this is a form of 
communication, a very limited one, but similar to the way an adult communicates. 
Based on the information contained inside the cry's wave, it can determined the 
infant's physical state; even detect physical pathologies, mainly from the brain, in 
very early stages. The initial hypothesis for this project is that if there exists this kind 
of relevant information inside the cry of an infant, the extraction, recognition and 
classification from the infant's cry can be possible through automatic means. In this 
work we present the design of a system that classifies different kinds of cries. These 
cries are recordings of normal, deaf and asphyxiating infants, of ages from one day up 
to one year old. In the model here presented, we classify the original input vectors, 
without reduction, in three corresponding classes, normal cry, hypoacoustic (deaf) and 
asphyxiating cries. 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 69-78, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 




70 



O.F. Reyes Galaviz and C.A. Reyes Garcia 



2 State of the Art 

Although there are not many works on research related to the automatic recognition of 
the infants cry, recently some advances have been developed that show interesting 
results, and emphasize the importance of doing research in this field. Using 
classification methodologies based on Self-Organizing Maps, Cano et al, in [2] report 
some experiments to classify cry units from normal and pathological infants. Petroni 
used neuronal networks [3] to differentiate between pain and no pain cry. Taco Ekkel 
[4] tried to classify sound of newborn cry in categories called normal and abnormal 
(hypoxia), and reports a result of correct classification of around 85% based on a 
neural network of radial base. In [5] Reyes and Orozco classify cry samples from deaf 
and normal infants, obtaining recognition results that go from 79.05% up to 97.43%. 



3 The Infant Cry Automatic Recognition Process 



The infant cry automatic classification process (Fig. 1) is basically a pattern 
recognition problem, similar to Automatic Speech Recognition (ASR). The goal is to 
take the wave from the infant's cry as the input pattern, and at the end obtain the kind 
of cry or pathology detected on the baby. Generally, the process of Automatic Cry 
Recognition is done in two steps. The first step is known as signal processing, or 
feature extraction, whereas the second is known as pattern classification. In the 
acoustical analysis phase, the cry signal is first normalized and cleaned, then it is 
analyzed to extract the most important characteristics in function of time. The set of 
obtained characteristics can be represented like a vector, and each vector can be taken 
like a pattern. The feature vector is compared with the knowledge that the computer 
has to obtain the classified output. 





lype of Cry 



PailioIoBr 

Detnrted 



Fig. 1. Infant Cry Automatic Recognition Process 



4 Acoustic Processing 

The acoustic analysis implies the selection and application of normalization and 
filtering techniques, segmentation of the signal, feature extraction, and data 
compression. With the application of the selected techniques we try to describe the 
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signal in terms of some of its fundamental components. A cry signal is complex and 
codifies more information than the one needed to be analyzed and processed in real 
time applications. For that reason, in our cry recognition system we used an extraction 
function as a first plane processor. Its input is a cry signal, and its output is a vector of 
features that characterizes key elements of the cry's sound wave. All the vectors 
obtained this way are later fed to a recognition model, first to train it, and later to 
classify the type of cry. We have been experimenting with diverse types of acoustic 
characteristics, emphasizing by their utility the Mel Frequency Cepstral Coefficients 
and the Linear Prediction Coefficients. 



4.1 MFCC (Mel Frequency Cepstral Coefficients) 

The low order cepstral coefficients are sensitive to overall spectral slope and the high- 
order cepstral coefficients are susceptible to noise. This property of the speech 
spectrum is captured by the Mel spectrum. The Mel spectrum operates on the basis of 
selective weighing of the frequencies in the power spectrum. High order frequencies 
are weighed on a logarithmic scale where as lower order frequencies are weighed on a 
linear scale. The Mel scale filter bank is a series of L triangular band pass filters that 
have been designed to simulate the band pass filtering believed to occur in the 
auditory system. This corresponds to series of band pass filters with constant 
bandwidth and spacing on a Mel frequency scale . On a linear frequency scale, this 
spacing is approximately linear up to IKHz and logarithmic at higher frequencies. 
Most of the recognition systems are based on the MFCC technique and its first and 
second order derivative. The derivatives normally approximate trough an adjustment 
in the line of linear regression towards an adjustable size segment of consecutive 
information frames. The resolution of time and the smoothness of the estimated 
derivative depends on the size of the segment [6]. 



4.2 LPC (Linear Prediction Coefficients) 

Linear Predictive Coding (LPC) is one of the most powerful speech analysis 
techniques, and one of the most useful methods for encoding good quality speech at a 
low bit rate. It provides extremely accurate estimates of speech parameters, and is 
relatively efficient for computation. Based on these reasons, we are using LPC to 
represent the crying signals. Linear prediction is a mathematical operation where 
future values of a digital signal is estimated as a linear function of previous samples. 
In digital signal processing linear prediction is often called linear predictive coding 
(LPC) and can thus be viewed as a subset of filter theory. In system analysis (a Sub 
field of mathematics), linear prediction can be viewed as a part of mathematical 
modeling or optimization [7]. The particular way in which data are segmented 
determines whether the covariance method, the autocorrelation method, or any of the 
so called lattice methods of LP analysis is used. The first method that we are using is 
the autocorrelation LP technique. As the order of the LP model increases, more details 
of the power spectrum of the signal can be approximated. Thus, the spectral envelope 
can be efficiently represented by a small number of parameters, in this cases LP 
coefficients [5]. 
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5 Cry Patterns Classification 

The set of acoustic characteristics obtained in the extraction stage, is represented 
generally as a vector, and each vector can be taken as a pattern. These vectors are 
later used to make the classification process. There are four basic schools for the 
solution of the pattern classification problem, those are: a) Pattern comparison 
(dynamic programming), b) Statistic Models (Hidden Markov Models HMM). c) 
Knowledge based systems (expert systems) and d) Connectionists Models (neural 
networks). For the development of the present work we used the connectionists 
models type, known as neural networks. We have selected this kind of model, in 
principle, because of its adaptation and learning capacity. Besides, one of its main 
functions is pattern recognition, this kind of models are still under constant 
experimentation, but their results have been very satisfactory. 



5.1 Neural Networks 

In a DARPA study [9] the neural networks were defined as a system composed of 
many simple processing elements, that operate in parallel and whose function is 
determined by the network's structure, the strength of its connections, and the 
processing carried out by the processing elements or nodes. We can train a neural 
network to realize a function in particular, adjusting the values of the connections 
(weights) between the elements. Generally, the neural networks are adjusted or 
trained so that an input in particular leads to a specified or desired output. The neural 
networks have been trained to make complex functions in many application areas 
including pattern recognition, identification, classification, speech, vision, and control 
systems. Nowadays the neural networks can be trained to solve problems that are hard 
to solve with conventional methods. There are many kinds of learning and design 
techniques that multiply the options that a user can take [10]. In general, the training 
can be supervised or not supervised. The methods of supervised training are those that 
are more commonly used, when labeled samples are available. Among the most 
popular models there are the feed-forward neural networks, trained under supervision 
with the back-propagation algorithm. For the present work we have used variations of 
these basic model, these are briefly described below. 



5.2 Feed Forward Input (Time) Delay Neural Network 

Cry data are not static, and any cry sample at any instance in time is dependent on 
crying patterns before and after that instance in time. A common flaw in the 
traditional Back-Propagation algorithm is that it does not take this into account. 
Waibel et al. set out to remedy this problem in [11] by proposing a new network 
architecture called the "Time-Delay-Neural Network" or TDNN. The primary feature 
of TDNNs is the time-delayed inputs to the nodes. Each time delay is connected to the 
node via its own weight, and represents input values in past instances in time. TDNNs 
are also known as Input Delay Neural Networks because the inputs to the neural 
network are the ones delayed in time. If we delay the input signal by one time unit 
and let the network receive both the original and the delayed signals, we have a 




Infant Cry Classification to Identify Hypoacoustics and Asphyxia 73 



YIM Yn.(r1 




Fig. 2. A time delay neural network whose input contains a number of tapped delay lines. 

simple time-delay neural network. Of course, we can build a more complicated one by 
delaying the signal at various lengths. If the input signal is n bits and delayed for m 
different lengths, then there should be nm input units to encode the total input. When 
new information arrives, it is placed in nodes at one end and old information shifts 
down a series of nodes like a shift register controlled by a clock. A general 
architecture of time-delay networks is drawn in Figure 2. [12] 

The Feed-forward Input delay neural network consists of N1 layers that use the dot 
product weight update function, which is a function that applies weights to an input to 
obtain weighed entrances. The first layer has weights that come from the input with 
the input delay specified by the user, in this case the delay is [0 1]. Each subsequent 
layer has a weight that comes from a previous layer. The last layer is the output of the 
network. The adaptation is done by means of any training algorithm. The performance 
is measured according to a specified performance function [10]. Some of the most 
notorious properties of TDNNs are: i) The Network is shift-invariant: A pattern may 
be correctly recognized and classified regardless of its temporal location, ii) The 
network is not sensitive to phoneme boundary misalignment: The TDNN is not only 
able to learn from badly aligned training data, it is even able to correct the alignment. 
It does this by learning where the phoneme's presence is significant within the 
segment of speech. This property is later used to perform recursive sample re- 
labeling. Hi) The network requires small training sets: In [13] Tebelskis quotes the 
findings of several papers that indicate that the TDNN, when exposed to time-shifted 
inputs with constraint weights, can learn and generalize well even with limited 
amounts of training data. 
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5.3 Training by Gradient Descent with Adaptive Learning Rate 
Backpropagation 

The training by gradient descent with adaptive learning rate backpropagation, 
proposed for this project, can train any network as long as its weight, net input, and 
transfer functions have derivative functions. Back-propagation is used to calculate 
derivatives of performance with respect to the weight and bias variables. Each 
variable is adjusted according to gradient descent. At each training epoch, if the 
performance decreases toward the goal, then the learning rate is increased. If the 
performance increases, the learning rate is adjusted by a decremental factor and the 
change, which increased the performance, is not made [10]. Several adaptive learning 
rate algorithms have been proposed to accelerate the training procedure. The 
following strategies are usually suggested: i) start with a small learning rate and 
increase it exponentially, if successive epochs reduce the error, or rapidly decrease it, 
if a significant error increase occurs, ii) start with a small learning rate and increase it, 
if successive epochs keep gradient direction fairly constant, or rapidly decrease it, if 
the direction of the gradient varies greatly at each epoch and Hi) for each weight an 
individual learning rate is given, which increases if the successive changes in the 
weights are in the same direction and decreases otherwise. Note that all the above 
mentioned strategies employ heuristic parameters in an attempt to enforce the 
monotone decrease of the learning error and to secure the converge of the training 
algorithm [14]. 



6 System Implementation for the Crying Classification 

In the first place, the infant cries are collected by recordings obtained directly from 
doctors of the Instituto Nacional de la Comunicacion Humana (National Institute of 
the Human Communication) and IMSS Puebla. This is done using SONY digital 
recorders ICD-67. The cries are captured and labeled in the computer with the kind of 
cry that the collector orally mentions at the end of each recording. Later, each signal 
wave is divided in segments of 1 second; these segments are labeled with a pre- 
established code, and each one constitutes a sample. For the present experiments we 
have a corpus made up of 1049 samples of normal infant cry, 879 of hypo acoustics, 
and 340 with asphyxia. At the following step the samples are processed one by one 
extracting their acoustic characteristics, LPC and MFCC, by the use of the freeware 
program Praat [1]. The acoustic characteristics are extracted as follows: for every 
second we extract 16 coefficients from each 50-millisecond frame, generating vectors 
with 304 coefficients by sample. The neural network and the training algorithm are 
implemented with the Matlab's Neural Network Toolbox. The neural network's 
architecture consists of 304 neurons on the input layer, a hidden layer with 120 
neurons, and one output layer with 3 neurons. The delay used is [0 1]. In order to 
make the training and recognition test, we select 340 samples randomly on each class. 
The number of asphyxiating cry samples available determines this number. From 
them, 290 samples of each class are randomly selected for training in one experiment, 
and 250 for another one. With these vectors the network is trained. The training is 
made until 2000 epochs have been completed or an 1x10 '’ error has been reached. 
After the network is trained, we test it with the 50 and 90 samples of each class set 
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apart from the original 340 samples. The recognition accuracy percentage, from each 
experiment, is presented in a confusion matrix. 



7 Experimental Results 

The classification accuracy was calculated by taking the number of samples correctly 
classified, divided by the total number of samples. The detailed results of two tests of 
each kind of acoustic characteristic used, LPC and MFCC, with samples of 1 second, 
with 16 coefficients for every 50 ms frame, are shown in the following confusion 
matrices. 

Results using 290 samples to train and 50 samples to test the neural network with 
LPC. 



Table 1. Confusion matrix showing a 93.33% precision after 2000 training epochs and an error 
of IxlOT 
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Results using 250 samples to train and 90 samples to test the neural network with 
LPC. 

Table 2 . Confusion matrix showing a 92.96% precision after 2000 training epochs and an error 
of IxlOL 




Fig. 3. Training with 250 samples and LPC feature vectors 
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Fig. 4. Training with 290 samples and MFCC feature vectors 

Results using 290 samples to train and 50 samples to test the neural network with 
MFCC. 

Table 3. Confusion matrix showing a 98.67% precision with an error convergence of 1x10* 
after only 937 training epochs. 
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Results using 250 samples to train and 90 samples to test the neural network with 
MFCC. 



Table 4. Confusion matrix showing a 96.30% precision with an error convergence of 1x10 * 
after only 619 training epochs. 
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7.1 Results Analysis 

As we can see from Figure 3 and Figure 4, the training of the neural network with 
LPC is slower that the one done with MFCC. With LPC features, the neural network 
stops until it reaches the 2000 epochs we defined, hut the error only goes down to 
1x10'^. On the other hand, with MFCC, the network converges when it reaches the 
defined error, that is 1x10'*, and after the training has reached only about 950 epochs, 
in the case of training with 290 samples, and 619 epochs for the 250 samples case. As 
can also be noticed, the results shown for the two experiments for each type of 
features are slightly lower when the network is trained with 250 samples. Our 
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interpretation on this is that, the classification accuracy improves with a larger 
number of training samples. This observation is suggesting to us that we should focus 
in collecting more crying samples, mainly of the asphyxia class, which is the one that, 
at the moment, is limiting the training set. We don't have to discard the fact that the 
training with LPC characteristics gave good results, the inconvenience was that the 
process of training was slower, the error was higher, and the classification accuracy 
was lower compared to the results obtained when using the MFCC features. 



8 Conclusions and Future Work 

This work demonstrates the efficiency of the feed forward input (time) delay neural 
network, particularly when using the Mel Frequency Cepstral Coefficients. It is also 
shown that the results obtained, of up to 98.67%, are a little better than the ones 
obtained in other previous works mentioned. These results can also have to do with 
the fact that we use the original size vectors, with the objective of preserving all 
useful information. In order to compare the obtained performance results, and to 
reduce the computational cost, we plan to try the system with an input vector 
reduction algorithm by means of evolutionary computation. This is for the purpose of 
training the network in a shorter time, without decreasing accuracy. We are also still 
collecting well-identified samples from the three kinds of cries, in order to assure a 
more robust training. Among the works in progress of this project, we are in the 
process of testing new neural networks, and also testing new kinds of hybrid models, 
combining neural networks with genetic algorithms and fuzzy logic, or other 
complementary models. 
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Abstract. This paper demonstrates the usefulness of syntactic trigrams 
in improving the performance of a speech recognizer for the Spanish lan- 
guage. This technique is applied as a post-processing stage that uses 
syntactic information to rescore the N-best hypothesis list in order to 
increase the score of the most syntactically correct hypothesis. The basic 
idea is to build a syntactic model from training data, capturing syntac- 
tic dependencies between adjoint words in a probabilistic way, rather 
than resorting to the use of a rule-based system. Syntactic trigrams 
are used because of their power to express relevant statistics about the 
short-distance syntactic relationships between the words of a whole sen- 
tence. For this work we used a standarized tagging scheme known as the 
EAGLES tag definition, due of its ease of use and its broad coverage of 
all grammatical classes for Spanish. Relative improvement for the speech 
recognizer is 5.16%, which is statistically significant at the level of 10%, 
for a task of 22,398 words (HUB-4 Spanish Broadcast News). 



1 Introduction 

Automatic speech recognition (ASR) has progressed substantially over the past 
fifty years. While high recognition accuracy can be obtained even for continuous 
speech recognition, word accuracy deteriorates when speech recognition systems 
are used in adverse conditions. Therefore, new ways to tackle this problem must 
be tried. 

Since speech recognition is a very difficult task, language knowledge has been 
succesfully used over many years to improve recognition accuracy. Human beings 
can make use of language information to predict what a person is going to say, 
and in adverse environments language information enables two people to follow 
a dialog. Similarly, language knowledge can be used to clean up the outputs of 
speech recognition systems. 

The use of syntactic constraints is potentially especially valuable for lan- 
guages like Spanish, where there are many semantically appropriate but acous- 
tically confusible words such as the masculine and feminine forms of nouns and 
adjectives and the various conjugations of verbs. 
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This paper describes the use of a language model rescoring procedure 
(Fig. 1). Our system uses an A^-best list to generate several potentially correct 
hypotheses. A postprocessing stage analyzes each entry on the A^-best list and 
rescores it according to linguistic criteria, in our case syntactic co-occurrence. 
Finally, the top hypothesis in the list is selected as the new correct hypothesis. 




l^^ecode 



Acustic Score I MUNDO DEL ESPECTACULO'^ 



Original n-Best List 



Acoustic Score 


“£L MUNDO EL ESPECTACULO” 




Acoustic Score 


“EL MUNDO DE ESPECTACULO” 




Acoustic Score 


■EL MUNDO DEL ESPECTACULO" 




Acoustic Score 


“EL MUNDO DE EL ESPECTACULO” 





Language Model 
reseorer 



Rescored n-Best List ▼ 



(++++) 


Final Score 


“EL MUNDO DEL ESPECTACULO” 




Final Score 


“EL MUNDO DE EL ESPECTACULO" 




Final Score 


“EL MUNDO DE ESPECTACULO” 


(+) 


Final Score 


“EL MUNDO EL ESPECTACULO” 



Fig. 1. Rescoring and sorting the A^-best list. 



In continuous speech recognition, the talker rarely uses grammatically well- 
constructed sentences, so attempts to use a formal language is not feasible. Prob- 
abilistic language models, on the other hand, have been very effective. Language 
modelling deals with the problem of predicting the next word on a utterance, 
given a previous known history of words. The language model definition is as 
follows: 

n 

P{W) = ^P{Wi\wi,W2,. . ■ (1) 

i=l 

where, W represents a sequence of words Wj. 

The particular case of n = 3, which corresponds to the trigram language 
model, has been found to be very powerful even though scope is limited to a 
short distance. The success and simplicity of trigrams is limited by only two 
difficulties, local span and sparseness of data [1]. 

Because of the way the word trigrams build the language model (computing 
statistics for triplets of adjacent words), they are unable to capture long-distance 
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dependencies. While data sparseness is a very serious problem, unseen triplets 
will lead to zero probability during recognition even when the acoustic probabil- 
ity is high [2] . Several approaches to smoothing have been proposed to ameliorate 
this problem. While these techniques do compensate for unseen trigrams, they 
also may favor incorrect grammatical constructions. Many other alternatives for 
language modelling have also been tried before, including the successful expo- 
nential language models [3,4], but their computational complexity is higher. 

In this work we build a language model based not only on local word depen- 
dencies but also on the syntactic structure of the history h of previous words. 
This method also reduces the data sparseness problem, since during decoding 
we make use of correct syntactic structures derived from training, even when 
specific words were unseen in the training corpus. 



2 Syntactic Trigrams for Prediction of a Word’s Class 

Inspired by the success of word trigrams, and knowing that each word has syntac- 
tic properties (or tags, see appendix A), we propose to complement the concept 
of word trigrams with syntactic trigram (tag trigrams), which are based on a 
word’s attributes rather than the words themselves. Basically, a syntactic tri- 
gram is a set of the syntactic tags that describe three adjoining words. In this 
way, the syntactic trigrams capture the short-distance grammatical rule used of 
a training corpus. 

In order to use the syntactic trigrams we need a tagged corpus for training 
the language model. During training we count each syntactic-tag trigram, using 
these numbers to develop probabilities that are invoked during a postprocessing 
stage of decoding. 

Let the syntactic tags of a whole sentence W of length m words {W = 
W\,W 2 , ■ ■ ■ ,Wm) be denoted as T = t\,t 2 , ■ ■ . ,tm- The overall correctness score 
for a whole sentence is computed by multiplying the syntactic trigram probabil- 
ities in a chain, according to the expression: 

m 

Psynt— 3g{W) — P(tj|tj_i, tj_2) (2) 

i=3 

For a whole sentence, the syntactic score is combined with the acoustical and 
conventional language models as follows: 

IT = argm^ P(A^ IP^){Psynt-3g{W)f^^ (3) 

Acoustic Score Language Model Score 

where IP and LW are the word-based language model parameters [5] (the in- 
sertion penalty and language weight, respectively) and SW is a syntactic weight 
that determines the effect of syntactic trigram rescoring on the iV-best list. As 
is seen, the acoustic and word trigram language model scores are augmented 
(rather than replaced) by the syntactic trigram score, with the goal of improv- 
ing the performance of the recognizer by implicitly encoding grammatical rules 
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from the training corpus, and hopefully improving the score of the correct hy- 
pothesis on the A^-best list. 

In addition to word error rate (WER) we considered two additional metrics 
to assess the effectiveness of the syntactic trigrams: depth reduction and top score 
total. The first of these metrics is the average of the reduction in depth of the 
correct hypothesis for every sentence even when it is not promoted to the top 
position of the A^-best list after rescoring, while the latter is the number of correct 
hypotheses that are raise to the top position. These numbers are computed as 
follows: 



N 



depth reduction = — 



i=l 

N 



after rescore 



before rescore 



depth{correct hypi) — depth{correct hypi) 



top score total = — ^{i) 



( 4 ) 

( 5 ) 



i=l 



where N is the total number of sentences in the evaluation set and (P{i) is defined 

by 

_ J 1 when correct hypi is in the topmost position of the A^-best list, 

^ [0 otherwise. 



3 Experimental Results 

In our experiments, we work with the CMU SPHINX-III speech recognition 
system to generate an A^-best list for the HUB-4 Spanish Broadcast News task, 
a database with a vocabulary of 22,398 different words (409,927 words in 28,537 
sentences), using 74 possible syntactic tags including grammatical type, gender, 
and number for each word. 

A series of experiments was conducted to evaluate the effectiveness of the 
syntactic trigram language model. The first set of tests used a language weight 
of 9.5 and served as our baseline [6]. Since we were interested in determining 
an optimal value for the syntactic weight parameter we tried several possible 
values, the range from 0.8 to 3.0 gived the best outcomes. Specifically, the results 
showed that the greatest average depth reduction in depth in the N-best list for 
the correct hypothesis was observed for parameter values of 0.80 and 0.85 of 
syntactic weight, but the maximum number of correct hypothesis that moved to 
the top position after rescoring was achieved at 0.85 (Fig. 2). 

We repeated the same syntactic weight variations for another set of tests 
using a language weight of 10.0 The results were slightly worst since the max- 
imum number of correctly rescored hypothesis was lower than in the previous 
configuration, however the average depth reduction was almost the same (Fig. 3). 

To complete the evaluation of the syntactic trigrams we rescored the A^-best 
list generated by the lattice rescorer available in the SPHINX-III system. Using 
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Fig. 2. Overall rescoring effectiveness of the syntactic trigrams nsing a language weight 
of 9.5. 




Fig. 3. Overall rescoring effectiveness of the syntactic trigrams nsing a language weight 
of 10.0. 



the best configurations for syntactic weights we obtained an overall WER of the 
decoder using the syntactic postprocessing. 

Results reported in Table 1 show a relative improvement of 4.64% (1.37% 
absolute) and 3.51% (0.93% absolute) using 9.5 and 10.0 as language weights, 
respectively. Using a simple validation test [7] we determined that all of these 
results are valid at a significance level of 10%. Using our best configuration 
(language weight of 10.0 and syntactic weight of 0.85) we note an overall im- 
provement of 5.16% relative (1.39% absolute) over our baseline system. 



Table 1. Effectiveness of the syntactic trigrams. 



Experiment 


Language Weight 


Syntactic W^eight 


WER 


Baseline 


9.5 


0.00 


26.90% 


Experiment 1 


9.5 


0.85 


25.65% 


Experiment 2 


9.5 


1.00 


25.89% 


Experiment 3 


10.0 


0.00 


26.44% 


Experiment 4 


10.0 


0.85 


25.51% 


Experiment 5 


10.0 


1.00 


25.53% 
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4 Discussion 

The results presented in the previous section show that the use of syntactic 
trigrams leads to a small but significant increase in recognition accuracy. Since 
these results taken in isolation may appear to be disappointing, we also tabulated 
the number of correct hypotheses that were rescored to the first two or three 
positions of the 7V-best list and compared these hypotheses to the ones above 
it. We performed analyses over several sentences where the mismatch was very 
obvious but even with rescoring, the the correct transcription never ended better 
than in the second position. 

An interesting case to examine is the sentence sv97725b. 00159. 377, for which 
the correct transcription is different from the hypothesis chosen by the rescorer: 



Hypothesis from the rescorer: 




“NUEVO” (Adjective, masculine, singular) 
Rescorer’s hypothesis: 



(la DAOFSO) (aun RG) (escasa AQOFSO) (profesionalidad NCFSOOO) (del SPCMS) 
(^ DIOMSO) (cuerpo NCMSOOO) (de SPSOO) (policia NCCSOOO) 



Trigram tags 


Trigram probability 


Accumulated in sentence 


D-FS- R- A FS- 


1.5454e-01 


1.5454e-01 


R- A FS- N-FS 


1.2432e-01 


1.9213e-02 


A FS- N-FS S MS 


6.0818e-02 


1.1685e-03 


N-FS S MS D MS- 


1.2601e-02 


1.4725e-05 


S MS D MS- N-MS 


4.7826e-01 


7.0425e-06 


D MS- N-MS S 00 


3.3157e-01 


2.3350e-06 


N-MS S 00 N-CS 


3.2299e-03 


7.5422e-09 



Correct transcription: 



(la DAOFSO) (aun RG) (escasa AQOFSO) (profesionalidad NCFSOOO) (del SPCMS) 
( nuevo AQOMSO) (cuerpo NCMSOOO) (de SPSOO) (policia NCCSOOO) 



Trigram tags 


Trigram probability 


Accumulated in sentence 


D-FS- R- A FS- 


1.5454C-01 


1.5454C-01 


R- A FS- N-FS 


1.2432e-01 


1.9213e-02 


A FS- N-FS S MS 


6.0818e-02 


1.1685e-03 


N-FS S MS A MS- 


4.0504e-02 


4.7331e-05 


S MS A MS- N-MS 


4.7115e-01 


2.2300e-05 


A MS- N-MS S 00 


3.4303e-01 


7.6497e-06 


N-MS S 00 N-CS 


3.2299e-03 


2.4708e-08 
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From the previous example we can see that the correct transcription got the 
highest score using the syntactic trigrams, but later when combined with the 
acoustic and word trigram language model scores, the overall results ranks the 
correct transcription at the second position of the rescored A^-best list. This 
example illustrates both the effective performance of syntactic trigrams and its 
dependence on other stages of the decoding process. In this particular case the 
contributions of other knowledge sources degrade the score of the correct hy- 
pothesis. This is not observed in general, as there are also some cases where 
the acoustic scores indeed provide additional information to disambiguate be- 
tween similar grammatical constructions. In the following example (sentence 
sv97725b. 01337. 376), the syntactic trigrams score the two different sentences as 
equal with equivalent grammatical constituents, as should be the case. 



Hypothesis from the rescorer: 





“CREO QUE PODIA DECIR CON TODA SEGURIDAD” 




Correct transcription: 






“CREO QUE PODRfA DECIR CON TODA SEGURIDAD’ 





Position of the correct transcription in the rescored AT-best list: 
Problem: “PODIA” (Verb, person, singular) 

“PODRIA” (Verb, person, singular) 



Rescorer’s hypothesis: 



(creo VMIPISO) (que CS) (podia VMIIlSO) (decir VMNOOOO) (con SPSOO) (toda 
DIOFSO) (seguridad NCFSOOO) 



Trigram tags 


Trigram probability 


Accumulated in sentence 


V ISO CS V ISO 


6.6420e-02 


6.6420e-02 


CS V ISO V 000 


1.2083e-01 


8.0258e-03 


V ISO V 000 S 00 


2.3036e-01 


1.8488e-03 


V 000 S 00 D FS- 


1.9225e-01 


3.5545e-04 


S 00 D FS- N-FS 


7.7827e-01 


2.7663e-04 



Correct transcription: 



(creo VMIPISO) (que CS) (podria VMIClSO) (decir VMNOOOO) (con SPSOO) (toda 
DIOFSO) (seguridad NCFSOOO) 



Trigram tags 


Trigram probability 


Accumulated in sentence 


V ISO CS V ISO 


6.6420e-02 


6.6420e-02 


CS V ISO V 000 


1.2083e-01 


8.0258e-03 


V ISO V 000 S 00 


2.3036e-01 


1.8488e-03 


V 000 S 00 D FS- 


1.9225e-01 


3.5545e-04 


S 00 D FS- N-FS 


7.7827e-01 


2.7663e-04 
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All the previous experiments were performed using an A^-best list of no more 
than 200 entries, one list for each sentence in the test set. The lattice rescorer, 
however, may create a shorter list for a sentence if the A* search algorithm is 
exhausted before the 200 hypotheses have been generated. Using this configura- 
tion, only about 35% of the A-best lists produced after lattice rescoring include 
the correct hypothesis. This obviously limits the performance of the syntactic 
trigram whole sentence language model. 

5 Conclusions 

Our results demonstrate small but statistically significant improvements in the 
recognition accuracy of the SPHINX-III decoder, using syntactic trigrams as 
a post-processing language model stage. There could be still a chance for an 
additional reduction in WER using other configurations of language weight and 
syntactic weight, but we should notice that even with the optimal selection of 
these parameters, the syntactic language model is restricted by the performance 
of the lattice rescorer. In general, raising the correct hypothesis to the top of the 
list is not possible for syntactic trigrams unless the correct hypothesis is included 
in the A-best list by the lattice rescorer. In other words, the effectiveness of the 
language model depends on the accuracy of the A* search algorithm employed in 
the last stage of decoding. It is possible that additional improvements in overall 
performance can be obtained by considering different values of the beam search 
parameter used in the algorithm or increasing the number of hypotheses on the 
list. 

Finally, in most of the cases where the correct hypothesis is reclassified to the 
second or third best position, the reason is a poor acoustic score, which vitiates 
the contribution of the syntactic trigrams. This suggests that incorporating this 
language model prior the generation of the A-best list could provide better 
results, as the acoustic scores, the word trigrams, and the syntactic scores could 
generate better hypotheses or lists of hypotheses. 

Acknowledgements. This research was partially supported by NSF and 
CONACyT (33002- A). The authors would also like to express their gratitude to 
Lluis Padro (Universitat Politecnica de Cataluyna) and Marco O. Pena (ITESM) 
for their support during the configuration of the automatic tagger used in this 
work. 



References 

1. Rosenfeld, R., Chen, S.F., Zhu, X.: Whole-Sentence Exponential Language Models: 
A Vehicle for Linguistic-Statistical Integration. Computer Speech and Language, 
15(1), 2001. 

2. Manning, C., Schiitze, H.: Fundations of Statistical Natural Language Processing, 
MIT Press (2001) pp 191-255. 

3. Jelinek, F.: Statistical Methods for Speech Recognition, MIT Press (1994) pp 57-78. 




N-best List Rescoring Using Syntactic Trigrams 



87 



4. Bellegarda, J., Junqna, J., van Noord, G.: Robustness in Language and Speech 
Technology, ELSNET/Klnwer Academic Pnblishers (2001) pp 101-121. 

5. Hnang, X., Acero, A., Hon, H.: Spoken Language Processing, Prentice-Hall (2001), 

pp 602-610. 

6. Hnerta, J.M., Chen, S., Stern, R.M.: The 1998 CMU SPHINX-3 Broadcast News 
Transcription System. Darpa Broadcast News Workshop, 1999. 

7. Gillick, L., Cox, S.J.: Some statistical issues in the comparisson of speech recognition 
algorithms. In Proceedings of IEEE International Conference on Acoustics, Speech 
and Signal Processing (ICASSP), pp 532-535, Glasgow, May 1992. 

8. Padro, L.:A Hybrid Environment for Syntax-Semantic Tagging (Ph.D. Thesis), 
Departament de Llenguatges i Sistemes Informatics, Universitat Politecnica de 
Cataluyna, Barcelona, 1998. 



A The EAGLES Tags 

The syntactic tags used in this work are based in the Expert Advisory Group 
on Language Engineering Standards (EAGLES) initiative [8]; this text encoding 
system tries to summarize the syntactic classification of each word in a code 
of letters and numbers to label specific attributes, such as word’s class, gender, 
number, person, case, type, etc. In Table 2 we show the EAGLES tags used 
for the experiments and the particular position for gender and number (a dash 
means ignored attribute). Since our experiments tried to enforce the syntactic 
coherence in gender and number the other attributes were ignored. 



Table 2. EAGLES tags. 



Grammatical class 


Tag 


Specific attributes 


Adjective 


A GN- 


G Masculine, Femenine, Common. 

N Singular, Plural, Invariable. 


Adverb 


R- 




Article 


T-GN 


G — >• Masculine, Femenine, Common. 
N Singular, Plural, Invariable. 


Determiner 


D GN- 


G Masculine, Feminine, Common. 

N Singular, Plural, Invariable. 


Noun 


N-GN 


G — >• Masculine, Feminine, Common. 
N Singular, Plural, Invariable. 


Verb 


V PNG 


P —¥ 1'®* person, 2 '^^ person, 3 ^^ person. 
N Singular, Plural, Invariable. 

G Masculine, Feminine, Common. 


Pronoun 


P-PGN 


P — >■ 1°* person, 2"" person, 3'"“ person. 
G Masculine, Feminine, Common. 

N Singular, Plural, Invariable. 


Conjunction 


CT 


T Coordinate, Subordinate. 


Numeral 


M-GN 


G — >• Masculine, Feminine, Common. 
N Singular, Plural, Invariable. 


Interjection 


I 




Acronyms 


Y 


IGNORED 


Prepositions 


S GN 


G Masculine, Feminine, Common. 

N Singular, Plural, Invariable. 


Punctuation marks 


F 


IGNORED 


Week days 


W 


w 
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B Syntactic Trigrams Counts 

The following is an example of a tagged sentence and how statistics (unigrams, 
bigrams and trigrams) are accumulated by our system. 



Vtterance-. “EL MUNDO DEL ESPECTACULO” 


Syntactic tags: 




(el TDMS) Article Sz Determiner Sz Masculine Sz Singular 


(mundo NCMSOOO) - 


4 Noun Sz Common & Masculine Sz Singular 


(del SPCMS) Preposition Sz Contracted Sz Masculine Sz Singular 


(espectaculo NCMSOOO) Noun Sz Common Sz Masculine Sz Singular 


Unigrams: 


Count(T-MS)++ 




Count(N-MS )++ 




Count(S MS)++ 




Count(N-MS )++ 


Bigrams: 


Count(T-MS & N-MS )++ 




Count(N-MS & S-MS)++ 




Count(S MS & N-MS )++ 


Trigrams: 


Count(T-MS & N-MS & S MS)++ 




Count(N-MS & S MS & N-MS )++ 
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Abstract. Among the techniques to protect private information by adopting 
biometrics, speaker verification is widely used due to its advantages in natural 
usage and inexpensive implementation cost. Speaker verification should 
achieve a high degree of reliability in verification score, flexibility in speech 
text usage, and efficiency in the complexity of verification system Continuants 
have an excellent speaker-discriminant power and the modest number of 
phonemes in the phonemic category. Multilayer perceptrons (MLPs) have the 
superior recognition ability and the fast operation speed. In consequence, the 
two elements can provide viable ways for speaker verification system to obtain 
the above properties: reliability, flexibility and efficiency. This paper shows the 
implementation of a system to which continuants and MLPs are applied, and 
evaluates the system using a Korean speech database. The results of the 
evaluation prove that continuants and MLPs enable the system to acquire the 
three properties. 

Keywords: Speaker verification, biometric authentication, continuants, 

multiplayer perceptrons, pattern recognition 



1 Introduction 

Among acceptable biometric-based authentication technologies, speaker recognition 
has many advantages due to its natural usage and low implementation cost. Speaker 
recognition is a biometric recognition technique based on speech. It is classified into 
two types: speaker identification and speaker verification. The former enrolls multiple 
speakers for system and selects one speaker out of the enrolled speakers associated 
with the given speech. By comparison, the latter selects the speaker previously 
enrolled for system and claimed by a customer, and decides whether the given speech 
of the customer is associated with the claimed speaker. The studies for speaker 
verification are being conducted more widely and actively because speaker 
verification covers speaker identification in technical aspect [ 1 ] . 

For speaker verification to be influential, it is essential to have a certain degree of 
each of three properties: reliability in the verification score of implemented system, 
flexibility in the usage of speech text, and efficiency in the complexity of verification 
system. First, the reliability of verification score is the most important property 
among three in authentication system. Authentication system should give a 
verification score as high as possible in any adverse situation. Second, the flexibility 
of the usage of speech text is required for users to access the system with little effort. 
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To resolve the overall characteristics of a speaker, utterances with the various organ 
positions of vocal tract must be provided, and this burdens users with long, various, 
and laborious uttering [2]. Hence, it is necessary to consider that a rather short 
utterance might be sufficient for satisfying a high verification score. Third, for low 
implementation cost the efficiency of system complexity has to be achieved [3]. To 
prevent from being invaded by feigned accesses, many of speaker verification 
systems are implemented in text-prompted mode [4], Although text-prompted mode 
can give immunity against the improper accesses that record the speech of an enrolled 
speaker and present the speech to system, it requires a speech recognition facility to 
recognize language units out of a text. Complex and advanced speech recognition 
results in increasing implementation cost of the entire system. 

To content with the three properties of reliability, flexibility and efficiency, we 
implement a speaker verification system using continuants and multilayer perceptrons 
(MLPs). Continuants are in a phonemic category of which voicing is continuous and 
unconstraint, and have an excellent speaker-discriminant power and the small number 
of phonemic classes. MLPs are one of artificial neural networks in which learning of 
neural networks is conducted by using the error backpropagation (EBP) algorithm, 
and have the superior recognition ability and the fast operation speed [2], [5], [6]. 
Having the characteristics of continuants and the advantages of MLPs, it is expected 
that the system can have the three properties and achieve high speaker verification 
performance. 

The composition of this paper hereafter is as follows. In Section 2 the feasibilities 
of the two ingredients of the proposed system are presented and we describe the 
implementation details of our speaker verification system in Section 3. The 
performance of the system is evaluated in Section 4. The paper is finally summarized 
in Section 5. 



2 Continuants and MLPs for Speaker Verification 

It is essential for speaker verification system to accomplish a high reliable verification 
score, flexible usage, and efficient system implementation. Continuants can realize 
the three properties in speaker verification system and MLPs help the system acquire 
more reliability and efficiency. In this section, we discuss briefly the feasibility of 
using continuants and MLPs for speaker verification system. 

There are many advantages of using continuants as a major language unit in 
speaker verification. Speaker verification can be understood as a vocal track model 
adopting multiple lossless tubes [7]. In view of this model, modeling speech signal 
based on language information is necessary because the intra-speaker variation is 
bigger than the inter-speaker variation, i.e. the speaker information from the inter- 
speaker variation is apt to be overwhelmed by the language information from the 
intra-speaker variation [2]. The three properties of speaker verification system are 
determined by the type of language unit used. Of various language units, phonemes 
can reflect efficiently the reliability and the flexibility. Phonemes are atomic language 
units. All words are composed of phonemes and the characteristics of different 
speakers can be finely discriminated within a phoneme. However, the capability of 
speaker verification varies as to phonemic categories mainly due to their steadiness 
and duration of voicing. Eatock et al. and Delacretaz et al. have studied such 
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difference in verification capability and their works are summarized in Fig. 1 [5], [8]. 
Continuants feature continuous and unconstraint voicing and include the best 
elements, nasals and vowels in Fig. 1. They show more improved verification 
capability than that of any other phoneme category, therefore enhance the reliability. 
Continuants can be easily detected by speech recognition facility because of their long 
voicing and the small number of kinds. As a result, continuants can largely enhance 
the implementation efficiency of speech recognition facility as well as the flexibility 
of compositions into any verification words with higher verification reliability. 




Fig. 1. Speaker identification error rates for the various phonemic categories reported in 
Eatock et al. and Delacretaz et al. 

MLPs are the nonparametric method of classifying suitable for speaker verification 
due to their higher recognition rate and faster recognition speed over the parametric 
methods of the existing systems. MLPs learn decision boundaries to discriminate 
optimally between models. For speaker verification, MLPs have two models needed 
to classify, i.e. enrolling speaker and background speakers. Such MLPs which have 
only two learning models present the effectiveness similar to the cohort speakers 
method developed in the existing parametric-based speaker verification, in which the 
cohorts consist of the background speakers closest to an enrolling speaker [9]. 
However, the cohort speakers method based on probability density functions might 
derive a false recognition result according to the distribution densities of an enrolling 
speaker and background speakers. That is, if the density of the enrolling speaker is 
lower and the variance of the speaker is higher than those of any background 
speakers, then a speaker different to the enrolled speaker might be accepted though 
the speaker is far from the enrolled speaker. On the other hand, MLPs avoid that 
problem because it discriminates the two models on the basis of their discriminative 
decision boundary. Figure 2 demonstrates such a situation when the enrolled speaker 
is male and customer is female, and compares the cohort speakers method with MLPs. 
In addition to it, MLPs achieve a superior verification error rate since they need not to 
assume any probability distribution of underlying data [6]. It is finally noted that the 
reason that MLPs show faster recognition speed can be analyzed that all the 
background speakers are merged into a model. The merging enables for MLPs to have 
no need to calculate likelihoods for each background speaker at verifying 
identities [9]. 
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3 Implemented System 

In this paper we implement a speaker verification system based on continuants and 
MLPs. Because this system is based on continuants, which has a small phoneme set, it 
might be adapted easily to any of text-modes such as text-dependent, text-independent 
and text-prompt mode [10]. However, the text-dependent mode is adopted in this 
system for easy implementation, in which enrolling text should be the same to 
verifying text. The speaker verification system extracts isolated words from input 
utterances, classifies the isolated words into nine Korean continuants (/a/, /e/, /a/, /o/, 
/u/, /!/, /i/, /!/, nasals) stream, learns an enrolling speaker using MLPs for each 
continuant, and calculates identity scores of customers. The procedures performed in 
this system are described in the following: 



(I) Analysis and Feature Extraction [11] 

The utterance input sampled in 16 bits and 16 kHz is divided into 30ms frames 
overlapped every 10ms. 16 Mel-scaled filter bank coefficients are extracted from each 
frame and are used to detect isolated words and continuants. To remove the effect of 
utterance loudness from the entire spectrum envelope, the average of the coefficients 
from 0 to 1 kHz is subtracted from all the coefficients and the coefficients are 
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adjusted for the average of the whole coefficients to be zero. 50 Mel-scaled filter bank 
coefficients that are especially linear scaled from 0 to 3 kHz are extracted from each 
frame and are used for speaker verification. This scaling adopts the study arguing that 
more information about speakers concentrates on the second formant rather than the 
first [12]. As with the extraction to detect isolated words and continuants, the same 
process to remove the effect of utterance loudness is applied here too. 



(2) Detecting Isolated Words and Continuants 

Isolated words and continuants are detected using an MLP learned to detect all the 
continuants and silence in speaker-independent mode. 



(3) Learning MLPs with Enrolling Speaker for Each Continuant 

For each continuant, the frames detected from the isolated words are input to the 

corresponding MLP and the MLP learns enrolling speaker with background speakers. 



(4) Evaluating Speaker Score for Each Continuant 

For each continuant, all the frames detected from the isolated words are input to the 
corresponding MLP. All the outputs of the MLPs are averaged. 



(5) Comparing Speaker Score with Threshold 

The final reject/accept decision is made by comparing a predefined threshold with the 
average from the step (4). 

Since this speaker verification system uses the continuants as speaker recognition 
units, the underlying densities show mono-modal distributions [2]. It is, therefore, 
enough for each MLP to have two layers structure that includes one hidden layer [5], 
[13]. Since the number of models for the MLPs to learn is two, one is enrolling 
speaker and the other background speakers, the MLPs can learn the models using only 
one output node and two hidden nodes. Nine MLPs in total are provided for nine 
continuants. 



4 Performance Evaluation 

To evaluate the performance of the implemented speaker verification system, an 
experiment is conducted using a Korean speech database. This section records the 
results of the evaluation. 



4.1 Speech Database 

The speech data used in this experiment are the recording of connected four digits 
spoken by 40 Korean male and female speakers, which the digits are Arabic numerals 
each corresponding to /goN/, /il/, /i/, /sam/, /sa/, /o/, /yug/, /cil/, /pal/, /gu/ in Korean 
pronunciation. Each of the speakers utters totally 35 words of different digit strings 
four times, and the utterances are recorded in 16 bits resolution and 16 kHz sampling. 
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Three of the four utterances are used to enroll speakers, and the last utterance to 
verify. As background speakers for MLPs to learn enrolling speakers discrimina- 
tively, 29 Korean male and female speakers except for the above 40 speakers are 
participated. 



4.2 Experiment Condition 

In this evaluation, MLP learning to enroll a speaker is set up as follows [6] : 

• MLPs are trained by the online mode EBP algorithm. 

• Input patterns are normalized such that the elements of each pattern are into the 
range from -1.0 to H-1.0. 

• The objectives of output node, i.e., learning targets, are h- 0.9 for an enrolling 
speaker and -0.9 for background speakers to obtain faster EBP learning speed. 

• Speech patterns of the two models are presented in an alternative manner during 
learning. In most cases, the numbers of patterns for the two models are not the 
same. Therefore, the patterns of the model having fewer patterns are repetitively 
presented until all the patterns of the model having more patterns are once 
presented, and it completes one learning epoch. 

• Since learning might be fallen in a local minimum, the maximum number of 
learning epochs is limited to 1000. 

Each of the 40 speakers is regarded as both enrolling speaker and true test speaker, and 
when a speaker out of them is picked as true speaker the other 39 speakers are used as 
imposters. As a result, for each test speaker 35-time tests are performed for true speaker 
and 1,560-time tests for imposter. As a whole, 1,400 trials of test for true speaker and 
54,600 trials for imposter are performed in the experiment. 

The experiment is conducted on a 1 GHz personal computer machine. In the 
experiment result, the error rate designates the equal error rate (EER), the number of 
learning epochs the averaged number of epochs used to enroll a speaker for a digit 
string word, and the learning duration the overall duration taken to learn these 
patterns. The values of error rate, the number of learning epochs, and learning 
durations are the averages for the results of three-time learning each with the same 
MLP learning condition to compensate for the effect of the randomly selected initial 
weights. 



4.3 Evaluation Results 

The experiment to evaluate the performance of the proposed system consists of two 
stages. First, the overall performance for the experimental speech database is 
measured. To do so, the parameters involved to train MLPs are searched to record 
error rates and the numbers of learning epochs when the best learning is achieved. 
Then, the learning records are analyzed for the advantages from the use of MLPs and 
continuants discussed in Section 2. To argue the merits of MLPs over the cohort 
speaker method, the EERs are divided into ones for the same sex and for the different 
sex. To analyze the advantages of continuants, the EERs according to the number of 
the extracted continuants and the number of frames in the extracted continuants for 
each digit string are measured. 
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Learning Rate 

Fig. 3. The performance points of the system with different learning rates 



The learning parameters in the MLP learning using the EBP algorithm include 
learning rate and learning objective error energy [6]. Learning rate is the parameter to 
adjust the updating step-length of the internal weight vector in MLPs. The learning 
rate that is too small or large value tends to prolong learning duration and becomes a 
cause to increase error rate. It is because a small value grows the number of learning 
epochs until the objective is reached and a large value makes learning oscillate around 
the optimal learning objective. Error energy gauges the difference between the desired 
output vector and the current output vector of an MLP, and learning objective error 
energy is the objective that MLPs must get to for the given learning data. Although 
error rate must be decreased as low learning objective error energy is taken, the 
number of learning epochs increases along with error rate. It is even possible for the 
error rate to get worse for the large number of epochs. As a consequence, it needs to 
determine the proper learning objective error energy and learning rate in the MLP 
learning using the EBP. 

The performance change of the implemented system as to various learning rates for 
the EBP algorithm is depicted in Fig. 3. Those values in the figure are to pursue the 
trajectories of the numbers of the learning epochs and the verification errors when 
learning objective error energy is fixed to 0.01. As seen in the figure, when the best 
learning is achieved, i.e. the number of learning epochs is 172.3, and the error rate 
1.65 %, the point of learning rate is 0.5. 

The performance change of the implemented system as to various learning 
objective error energies is depicted in Fig. 4. Those values in the figure are to pursue 
the trajectories of the numbers of the learning epochs and the verification errors when 
learning rate is fixed to 0.5 as determined in Fig. 3. As seen in the figure, when the 
optimal learning is achieved, i.e. the number of learning epochs is 301.5, and the error 
rate 1.59 %, the point of learning objective error energy is 0.005. The best 
performance achieved is summarized in Table 1. 

To demonstrate the suitability of MLPs to speaker verification, the experimental 
results are analyzed for reliability and efficiency. The high reliability of MLPs is 
derived from their discriminative decision boundary. The discriminative decision 
boundary is to protect the verification system from the error by the difference of 
distribution densities between speakers of the two types: enrolling and background 
speakers. To verify this, the experimental speaker set is rearranged to examine two 
different experimental conditions; one is such that the sex of enrolling speakers is the 
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Table 1. The best performance of the implemented system 



HER (%) 


Number of 


Enrolling 


Verifying Duration 


Epochs 


Duration (sec) 


(millisec) 


1.59 


301.5 


2.7 


0.86 



Table 2. Comparison of the error rates for the same sexes and for different sexes at enrolling 
and verifying speakers 





Entire 


Same 


Different 




Database 


Sex 


Sex 


EER (%) 


1.59 


2.29 


0.78 




Fig. 4. The performance points of the system with different learning objective error energies 
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Fig. 5. Experiment result analyses: (a) Distribution of error rates according to the numbers of 
the extracted continuants for each verifying digit string; fb) Distribution of error rates according 
to the numbers of frames in the extracted continuants for each verifying digit string 
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same to those of verifying speakers and the other vice versa. The results of the 
rearrangement are presented in Table 2 and enable us to know that the high EER for 
the same sex and the low EER for the different sex are obtained over the EER for the 
entire database. 

To illustrate the superior properties of continuants for speaker verification, the 
experiment results are analyzed for reliability in error rate, flexibility in utterance 
duration and efficiency in the number of recognition units. Figures (a) and (b) of 
Fig. 5 represent the distribution of error rates according to the numbers of the 
extracted continuants for each digit string of identity verifying and the distribution to 
the numbers of frames in the extracted continuants, respectively. The result of figure 
(a) informs that error rate decreases near linearly as the number of continuants 
increases and the error rate less than 1 % can be obtained when over 7.5 out of 9 
continuants are included in verifying utterance. Figure (b) means that only the level of 
2 to 4 seconds in utterance duration, when the duration of unit frame is 10 ms, is 
sufficient to achieve fairly low error rate, even though it needs to be born in mind that 
the verifying utterance includes consonants and other phonemes besides continuants. 
The results of Fig. 5 say that it is more important to get many continuants rather than 
to keep the utterance of the same continuant long when the duration is over some 
level. It is finally noted that the complexity of speech recognition module to identify 
continuants will be eased from the fact that the number of continuants to be 
recognized is only up to 9. 



5 Conclusion 

The measurements and analyses for the experiment brought out the credibility, 
elasticity and practicality that continuants and MLPs can show for the application to 
speaker verification. To make an appeal, speaker verification should achieve a high 
degree of credibility in verification score, elasticity in speech text usage, and 
practicality in verification system complexity. Continuants have an excellent speaker- 
discriminant power and the small number of classes, and multilayer perceptrons have 
the high recognition ability and the fast operation speed. In consequence, the two 
provide feasible means for a speaker verification system to obtain the above three 
properties. In this paper we implemented the speaker verification system to which 
continuants and MLP were applied, and measured and analyzed the system for 
performance evaluation using the Korean continuously spoken four-digit speech 
database. The results of the experiment ascertain that continuants are very effective to 
achieve all the three properties and MLPs enable the system to acquire more 
credibility and practicality. 
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Abstract. Agent protocols are difficult to specify, implement, share and 
rense. In addition, there are not well developed tools or methodologies 
to do so. Current efforts are focused in creating protocol languages with 
which it is possible to have formal diagrammatic representations of pro- 
tocols. What we propose is a framework to nse ontology technology to 
specify protocols in order to make them shareable and reusable. 



1 Introduction 

Agent protocols are very complex and have different elements which need to 
be carefully combined and well understood in order to make them useful. They 
need to be unambiguous and readable by both humans and agents. This is not 
an easy task because even most simple agent protocols can be very complex and 
difficult to implement. In addition, there are not well developed tools to make it 
easy. For instance, in [6] the author says that Perhaps the major challenge from 
a technical point of view was the design of the protocols. . . and this is because 
even in big implementations the protocols are built from scratch and there are 
not shared and reusable libraries or methodologies to create a protocol for a 
specific purpose. If there are some, it may be not be worth trying to use them 
because — at the current stage of protocols development — it is less risky and 
easier to create the desired protocol from scratch. 

Protocols are complex because they have several components that need to 
work well and smoothly together. For instance, unknown agents aiming to have 
a conversation need to agree about the language to use and its semantics. In 
addition, they need to share the knowledge needed to use that language in a 
specific matter or ontology. If we are talking about non centralised protocols, the 
coherence of the conversations relies on the ability of every agent to understand 
and interpret properly all the protocol components. 

What we propose here is a framework that uses ontology technology to specify 
protocols. What we do is to create the ontologies with an ontology editor (e.g. 
Protege [8]) and convert them to the Resource Description Framework (RDF) 
model. In this way it is possible to generate protocol instantiations from reusable 
specifications. The challenge here is to create a framework which supports lan- 
guages sufficiently expressive to specify protocols and their formal diagrammatic 
specification. In this paper we address the following issues: 

1 . Is it possible to keep in an ontology all the features that protocol languages 
provide? 

2. What do we gain from using this approach to specify the protocols? 
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2 Background 

An Agent Communication Protocol aimed to be a communication channel be- 
tween unknown, autonomous and heterogeneous agents would ideally have most 
of the following characteristics: 

— It needs to be specified in such a way that there is no room for ambiguities 
which lead the developers to different interpretations of the protocol. 

— The specification language needs to be expressive enough to state all the 
required features in a protocol of this kind. 

— The specification needs to be as easy as possible to read/understand and 
able to be run in different computer languages and platforms. That means 
that the way to specify the protocol needs to be a standard or a widely used 
language. 

— The specification needs to be easy to maintain, share and reuse. 

~ The specification needs to support concurrency. 

Protocols in agent communication have been specified in many different ways. We 
can find specifications based on the combination of natural language and AUML^ 
diagrams[4]. This is a very general specification of the type of messages (i.e. the 
performative to use) and the sequence that agents should follow to exchange 
them in a context with a high level of generality. For complex conversations 
agents need to have more information than that. For instance, the possible reason 
why the other participants or themselves are reacting in a particular way. If an 
agent is aiming to have conversations with unknown and heterogeneous agents, 
it needs to be able to solve aspects of the protocol that are not well defined in 
this kind of specification. And the developers are forced to interpret the lack 
of information (or ambiguity) in their own way to equip their agents with the 
complete machinery needed to run the protocol)!]. 

There is work going on to tackle this issue. Much of the recent effort focuses 
on techniques that allow protocols to be specified in two different ways. The 
first one is a formal language with enough expressiveness to create unambiguous 
protocols which do not leave room for different interpretations. The second is 
to find a diagrammatic representation with which it is possible to represent a 
protocol without losing expressiveness. In [7] the authors propose the ANML 
protocol language and a diagrammatic representation which combines an exten- 
sion of statechart notation and some ANML features. In [9] the author describes 
a formal method using rewrite rules to specify asynchronous and concurrent 
agent executable dialogues. This description allows the specification of formally 
concurrent protocols. 

Although these approaches present executable descriptions of protocols, the 
problem of interoperability is not solved. In addition, we would like to have 
a mechanism to specify protocols which has features from both languages, for 
instance, concurrency and diagrammatic representation. 

^ Agent-based Unified Modelling Language 
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3 Protocols as Ontologies 

If we take a well known and general definition of what an ontology is we may 
say that it is an explicit specification of a conceptualisation [5]. From this view 
it would be possible to specify any concept. A protocol is a concept, therefore 
it can be specified in this way. However, ontologies have been traditionally used 
for specifying objects, not processes or exchange of messages. A major difference 
between a protocol and an object is that the protocol needs to be executable. 
That is, a protocol is a specification that can lead an agent to perform some 
actions and to put itself in a specific position within a group of agents. When we 
talk about heterogeneous computer programs (i.e, software agents) which have 
some degree of autonomy and may or not have interacted before — to collaborate 
or perhaps to compete — , that is not a trivial issue. 

What we propose here is to specify the protocols using ontology technology. 
One advantage of this approach is that agent technology is dealing with the 
issue of how agents should understand, share and handle ontologies and these 
efforts are being done separately from agent protocols. We can assume that 
agents will have to extend their capabilities of ontology management to be able 
to cope with these executable ontologies. In fact, we see agent protocols as a 
special case of ontologies. 

4 Architecture 

In order to specify the protocol as an ontology it is necessary to identify which 
are the main elements that are always present in any protocol. Doing this we 
are able to make a generic ontology which can be used as a template to create 
instantiations from it. The main elements that we can find in protocols are: 

Roles are assigned to agents. They define the set of states and transitions that 
the agent is allowed to pass through and the actions that the agent is allowed 
and perhaps, obligated to perform. 

States may be explicitly or implicitly stated. Every agent is always in a par- 
ticular state related to the role it is playing. States may be specified in a 
hierarchical way which specifies that X C Y where X and Y are states. 
Transitions are links between states and are used by the agents to go from 
one state to another. They need to be triggered by preconditions. They are 
often related to processes that should be executed when the transition has 
been triggered. They have a direction that specifies the origin state and the 
target sate. 

Actions are specific processes that the agents need to execute at some point in 
the protocol. They may be executed within a state or a transition. 
Operators relate transitions and actions to lead the agents to the next step to 
perform. 

Messages are the units of information that agents send and receive among 
them. The agents may send messages during a process within a state or 
during a transition. Messages may be received at any time. 
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Decision points are places where agents need to make decisions in order to 
carry on with the protocol. They involve operators, transitions and actions. 
This point is central to the ambiguity problem since some protocol specifi- 
cations do not say how the agents may/should take decisions. 

4.1 The High Level Set of Ontologies 

In this section we describe in a top-down way the high level architecture of the 
different ontologies we put together to get a template from which we can create 
instantiations. The whole architecture is based on ontologies. That is, all the 
elements described are ontologies or properties of an ontology. An element of an 
ontology may be an ontology or an atomic value. We describe ontologies as tuples 
in the form o = {ei, . . . , e„} where n is the number of elements in o. We describe 
the structure of an ontology as a record type Ontology = {A\ : Oi, . . . , Am '■ Om} 
with m fields named Ai, . ■ ■ Am', each field type may be an ontology or an atomic 
type AtomicType that is defined as 

Ontology .AtomicType ::= integer\float\symbol\uri\ip\boolean\ . . . 

where the left hand side of the dot represents the ontology and the right hand 
side the type. It is possible to have a wide range of atomic types. We assume 
that there are ontologies that describe lists and sets where the elements may be 
atomic values or ontologies. The ontology Protocol is defined as 

Protocol = {Name, U RI, Roles, States, Processes, Transitions, Messages} 

where Name is a symbol that identifies a particular kind of protocol, for instance 
Name = english_auction. This information helps the interoperability since these 
symbols may become a standard among agent community. If a new type of pro- 
tocol is created, it is an easy task to create a new symbol for it. URI is an 
Uniform Resource Identifier[ll] which helps to identify a specific instantiation of 
a particular protocol. For example, URI = whereagentsare.org:26000/ABC123 
describes a particular machine and port where the agents need to request to par- 
ticipate in the protocol ABC123. This protocol may be held just for a few hours 
or days and once finished that URI will be no longer valid to participate in any 
other protocol. Instead, the URI may be used to retrieve the history of the pro- 
tocol (e.g. identity of the participants and the result of the protocol) as necessary. 

Roles is an ontology which specifies all roles that may be assumed by the 
agents and those agents who can assume those roles. We say that 

Roles = {RName, SetOf Agents, SetOf States} 

where RName is a symbol and SetOf Agents is an ontology that describes the 
agents allowed to assume the role and how many agents can assume the role at 
the same time. SetOf Agents is described as 



SetOf Agents = {Agents, MaxNumber} 
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where Agents is an ontology that describes a set of agents. Every agent is de- 
scribed as 



Agent = {ANanie, IP, Port} 

where ANanie is a unique^ symbol, IP is an Internet Protocol Address and 
Port is and integer that specifies the port where the agent can be reached at 
that IP. This information allows identification of a particular agent on the 
Internet. This implies that the same agent may reside in another IP/port at the 
same moment of performing the protocol allowing the agents to be mobile and 
ubiquitous. MaxNumber is the maximum number of agents that can assume 
this role at once. If MaxNumber = —1 the maximum number is not defined 
and there is no limit. 

SetOf States is an ontology describing the set of states that a particular role 
is allowed to pass through in the protocol. Every state is specified as 

States = {SName, Ancestor, TransIn,TransOut,TransFormulae} 

where SName is a symbol, Ancestor is defined as 

Ancestor ::= SName\nil 

that allows to have a hierarchical structure of states which can have a root 
ancestor. TransIn and TransOut are sets of transitions which allow agents to 
enter and leave the sate. These are defined as: 

Transitions = {TName, StateFrom, StateTo, ProcFormulae, Parameters} 

where TName is a symbol, StateFrom and StateTo are sets of state names 
which specify the states that are linked by the transition. ProcFormulae is a 
formula that defines the way the processes that are involved in the transition 
should be executed. When the formula is evaluated, there is a boolean output. 
Parameters defines the parameters that should be passed to every process in- 
volved in ProcFormulae. Before explaining how these two elements work, we 
describe TransFormulae which is an element of States ontology and that is the 
upper level formula. 

TransFormulae is the decision point in every state. When agents enter a 
state, they should evaluate this formula in order to know the next step to do. 
That is, which transition(s) they should follow. The decision point is not deter- 
ministic since the evaluation’s result may vary depending on the agents’ beliefs 
and capabilities. 

The transition operators we describe here are: A which describes an or in 
the sense that if Ta is true in Ta A Tb then Tb is not evaluated; V describes 
an or with the meaning that both operands will be evaluated and will return 
true if either of them is true; the operator par works as described in [9] where 

^ The issue of how to make these symbols unique is outside the scope of this paper 
but is very relevant for the interoperability problem. 
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both transitions will be evaluated concurrently; the operator Ao has the syntax 
A A oB [9] where if A is true, B will be evaluated at some point in the future; 
the operator Vpar evaluates both operands concurrently and when one of them 
is true will stop evaluating the other. TransFormulae is defined as 

TransFormulae ::= boolean\nil\TOp' ^ F P FP']' 

TOp ::= A| v \par\ A o| V par 
FP ::= TransF ormulae\P N ame 

In figure 1 we present a diagrammatic representation of the transition operators 
as defined above. The little dots at the end of the lines between transitions show 
the precedence in which every transitions should be evaluated. 




A V par AO \/par 



Fig. 1. Transition Operators 

PName is a process name explained below. The processes are defined as 

Process = {PName, Par am Structure, Output, 

Description, SampleCode} 

where PName is a symbol, ParamStructure is an ontology describing the 
process’ parameters (e.g. order and type). Output ::= Ontology. AtomicType. 
Description is a brief explanation of the functionality of the process stated in 
natural language and SampleCode is a small programming language code to 
illustrate the process. These two last properties are omitted in this paper for 
reasons of space. We now come back to ProcFormulae which is a formula that 
specifies how processes are evaluated and is defined as 

ProcFormulae ::= nil\boolean\Comparison\LogicOperator'[' PC PC']' 
PC ::= ProcFormulae\C omparison 
LogicOperator ::= V| A 

Comparison ::= ComparisonOp'^ PAT PAT']' 

PAT ::= P N ame]Ontology .AtomicType 
ComparisonOp ::= = | yf | < | > 

where we have two logical operators and four comparison operators. Every time 
a process is present in ProcFormulae, it has a set of parameters specified in 
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Parameters. If a process is specified more than once, the order of the sets of 
parameters matches with the order of precedence in ProcFormulae. 

Parameters ::= '['{PName'[' ParamList*']')*']' 

ParamList ::= Ontology .AtomicType 

Messages is an ontology based in the FIPA ACL^ Message Structure 
Specification [3]. For simplicity, we are not using all the fields. 

Messages = {Per formative, Sender, Receivers, Ontology, 
Protocol, Content, Language} 

Performative ::= request\inform 
Sender ::= Agent 
Receivers ::= '[' Agent^']' 

Ontology ::= uri 
Protocol ::= uri 



Content is the content of the message and is what the agent who sends the 
message wants to communicate to the receivers. Language is the language used 
to express the content. 

5 Example 

We present the simplified case for one of the protocols of Agenteel, a system that 
we are implementing and with which we are exploring the ideas presented here. 
Agenteel is a multi agent framework which architecture requires an agent called 
yellowPages and a set of agents called nodes. The main function of yellowPages 
is to stay online in the same point on the web. In the protocol, where to find 
yellowPages is stated. So any node capable of finding, reading and executing 
the protocol can find it easily. The issue of where to find the specification is not 
discussed in this example: we assume that the agents know where to find the 
protocol’s specification. When a node wants to be incorporated in the framework, 
makes contact with yellowPages and obtains the information about where the 
other agents are and gives the information of where it can be found. In addition, 
yellowPages propagates to the other nodes the address of the new node. So our 
first definition is 

Protocol = {agenteel, whereagentsare.org : 26000/001, 

{{yellowPages, {yp, whereagentsare, 26000}, I}, 

{listening, attendM essage}} , 

{node, {},-!}, 

{initialising, waitingConfirmation, online, offline}} 
States, Processes, T, Messages} 

® Agent Communication Language 
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where the role yellowPages can be enacted by a single agent which should 
be specifically the agent yp. This issue of authority in protocols is important: 
who can take what roles? In contrast, there is room for any number of nodes. 
Any agent may execute the protocol and assume the node role. The states are 
defined as 

States = {{listening , nil, {ti}, {^1,^2}, decisioni}, 

{attendMessage, nil, {^2}; nil, nil}, 

{initialising, nil, nil, {ts}, decision2}, 
{waitingConfirniation, nil, {fa}, {^4, t^}, decisions} , 

{online, nil, {^4}, nil, nil}, 

{offline, nil, {^5}, nil, nil}} 

With this information an agent can have an idea of the paths that is possible to 
take not just for the role it will assume but for the other roles. In addition, it is 
possible to know where are the decision points. The transitions are defined as 

T = {{ti, listening, listening, = [receivedMessage true], nil}, 

{t2, listening, attendM essage, 

= [getMessage inf orm],[getMessage[per formative M]]}, 

{ts, initialising , waitingConfirmation, 

= [sendM essage true], [sendM essage[yp M]]}, 

{t4, waitingConfirmation, online, 

A[= [getM essage inform] = [getM essage accepted]], 
[getMessage[per formative M] getMessage[content M]]}, 

{ts, waitingConfirmation, offline, > [timeOut 10 ], [timeOut\T]]}} 

The transitions’ specification gives the agent the necessary information to know 
how the processes should be performed and what conditions are needed to trigger 
the transitions. 

Processes = {{getM essage, [symbol], message.abstract} , 
{receivedMessage, [], boolean}, 

{sendMessage, [agent message], boolean}, 

{timeOut, [time], boolean}} 

The processes’ specification completes the information that the agents need to 
know if they are capable of performing the processes in order to assume a specific 
role. Agents need to answer the question: am I capable of assuming this role? 
In the processes we have omitted Description and SampleCode to simplify 
the example. In Par am Structure we have defined only the parameters’ type. 
message.abstract assumes that all the fields in the ontology Messages have 
been inherited from an abstract type that allows the output to be of any field 
found in it. The decisions points are 
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decisioni = A o par[ti ^ 2 ]] 
decision 2 = 

decision^ = Vpar[t 4 t^] 

where there are three decision points. In decisioni, yellowPages acts as a server 
listening for new incoming messages and when a message arrives it breaks the 
execution in two threads; one to go back to keep listening and the other to 
attend the message. In decision 2 the agent does not have any choice, when it is 
initialising the protocol, it should send a message to yellowPages. In decisions 
the agent evaluates both transitions and waits until it receives a confirmation 
that it has been included in the framework. On the other hand, if the process 
timeOut reaches its limit the agent is not included in the framework. In this 
case, to get into the framework, the agent would need to start the protocol from 
the beginning again. In figure 2 are depicted the two simplified diagrammatic 
representations — one for each role — that together comprise the whole protocol. 



ti 




Fig. 2. Agenteel Protocol 



6 Current Work and Conclusions 

We are using Protege [8] to create this kind of protocol and to convert them 
to the RDF model. This model is divided in two parts. The first one is the 
structure of the ontology and the second is the instantiation of it. This allows 
the agents to read the instantiation of the protocol and to reason about any 
term found in it using the structure’s specification. Agenteel is created using the 
Mozart Programming Language[10]. We are using a XML parser[2] that has the 
capability of reading RDF specifications and parse them into Mozart’s internal 
data structures. This shows that it is possible to specify protocols using ontol- 
ogy technology and to execute them using a programming language. Agenteel 
supports protocols which are held by only one agent. That is, the agents internal 
behaviour is controlled by internal protocols or soliloquies. 
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Even though there is much more work to do we believe that it is possible 
to specify protocols using ontologies and keep the features needed by the agent 
community. This will improve agent interoperability. What we gain doing so 
is that instead of tackling the problems of ontologies and protocols separately 
we have a bigger picture of a problem that we think embraces both issues. In 
addition to this, we contribute to the task of creating shareable and reusable 
protocols. Reusability is a feature that software technology has been developing 
as one of its most important characteristics. As part of software improvement it 
is necessary that agent developers evaluate the possibilities that programming 
languages offer in order to implement agent protocols which are very demanding. 
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Abstract. Both ontology content and ontology building tools evaluations play 
an important role before using ontologies in Semantic Web applications. In this 
paper we try to assess ontology evaluation functionalities of the following 
ontology platforms: OilEd, OntoEdit, Protege-2000, and WebODE. The goal of 
this paper is to analyze whether such ontology platforms prevent the ontologist 
from making knowledge representation mistakes in concept taxonomies during 
RDE(S) and DAML-l-OIL ontology import, during ontology building and during 
ontology export to RDF(S) and DAML-l-OIL. Our study reveals that most of 
these ontology platforms only detect a few mistakes in concept taxonomies 
when importing RDF(S) and DAML-l-OIL ontologies. It also reveals that most 
of these ontology platforms only detect some mistakes in concept taxonomies 
during building ontologies. Our study also reveals that these platforms do not 
detect any taxonomic mistake when exporting ontologies to such languages. 



1 Introduction 

Ontology content should be evaluated before using or reusing it in other ontologies or 
software applications. To evaluate the ontology content, and the software used to 
build ontologies are important processes to take into account before integrating 
ontologies in final applications. Ontology content evaluation should be performed 
during the whole ontology life-cycle. In order to carry out such evaluation, ontology 
development tools should support content evaluation during the whole process. 

The goal of ontology evaluation is to determine what the ontology defines 
correctly, what it does not define or defines incorrectly. Up to now, few domain- 
independent methodological approaches [4, 8, 11, 13] have been reported for building 
ontologies. All the aforementioned approaches identify the need for ontology 
evaluation. However, such evaluation is performed differently in each one of them. 

The main efforts on ontology content evaluation were made by Gomez-Perez [6, 7] 
and by Guarino and colleagues with the OntoClean method [9]. 

In the last years, the number of tools for building, importing, and exporting 
ontologies has increased exponentially. These tools are intended to provide support 
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for the ontology development process and for the subsequent ontology usage. 
Examples of such platforms are: OilEd [2], OntoEdit [12], Protege-2000 [10], and 
WebODE [3, 1]. 

Up to now, we do not know of any document that describes how different ontology 
platforms evaluate ontologies during the processes of import, building and export. In 
this paper we study whether the previous ontology platforms prevent the ontologist 
from making knowledge representation mistakes in concept taxonomies. 

We have performed experiments with 24 ontologies (7 in RDE(S)*’ ^ and 17 in 
DAML-hOIL^) that are well built from a syntactic point of view, but that have 
inconsistencies and redundancies. These knowledge representation mistakes are not 
detected by the current RDF(S) and DAMLh-OIL parsers [5]. We have imported these 
ontologies into the previous ontology platforms. We have also built 17 ontologies 
with inconsistencies and redundancies using the editors provided by the previous 
platforms. After that, we have exported such ontologies to RDE(S) and DAMLh-OIL. 

This paper is organized as follows: section two describes briefly the method for 
evaluating taxonomic knowledge in ontologies. Section three gives an overview of the 
ontology platforms used. Section four exposes the results of importing, building and 
exporting RDF(S) and DAMLh-OIL ontologies with taxonomic mistakes in the 
ontology platforms. And, section five concludes with further work on evaluation. 



2 Method for Evaluating Taxonomic Knowledge in Ontologies 

Figure 1 shows a set of the possible mistakes that can be made by ontologists when 
modeling taxonomic knowledge in an ontology [6]. 



‘Circularity Issie 
Inconsistency C Partition Errors 
^emantic Errors 



[ Common classes in disjoint deconrpositions and partitions 
I Common instances in disjoint decompositions and partitions 
External classes in e:^diaustive decompositions atul partitions 
External instances in eKhaustive deconpositions and partitions 



Incompleteness 



j Inconplete Concept Classification 






Partition Errors 



Disjoint knowledge omission 
EKhaustive knowledge omission 



Redundancy 



Grammatical 



Redundancies of subclass of relations 
Redundancies of instance of relations 



Identical formal definition of some classes 
Identical formal definition of some instances 



Fig. 1. Types of mistakes that might be made when developing taxonomies 



In this paper we have focused only on inconsistency mistakes (circularity and 
partition) and grammatical redundancy mistakes, and have postponed the analysis of 
the others for further works. 



' http://www.w3.org/TR/PR-rdf-schema 
^ http ://www. w3 .org/TR/REC-rdf-syntax/ 

^ http://www.daml.org/2001/03/damlH-oil-walkthm.html 
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We would like to point out that concept classifications can be defined in a disjoint 
{disjoint decompositions), a complete {exhaustive decompositions), and a disjoint and 
complete manner {partitions). 



3 Ontology Platforms 

In this section, we provide a broad overview of the tools we have used in our 
experiments: OilEd, OntoEdit, Protege-2000, and WebODE. 

OilEd"* [2] was initially developed as an ontology editor for OIL ontologies, in the 
context of the 1ST OntoKnowledge project at the University of Manchester. However, 
OilEd has evolved and now is an editor of DAMLh-OIL and OWL ontologies. OilEd 
can import ontologies implemented in RDF(S), OIL, DAMLh-OIL, and in the SHIQ 
XML format. OilEd ontologies can be exported to DAMLh-OIL, RDF(S), OWL, to 
the SHIQ XML format, and to DIG XML format. 

OntoEdiU [12] was developed by AIFB in Karlsruhe University and is now being 
commercialized by Ontoprise. It is an extensible and flexible environment and is 
based on a plug-in architecture, which provides functionality to browse and edit 
ontologies. Two versions of OntoEdit are available: Free and Professional. OntoEdit 
Eree can import ontologies from ELogic, RDE(S, DAMLh-OIL, and from directory 
structures and Excel files. OntoEdit Free can export to OXML, FLogic, RDF(S, and 
DAML-hOIL. 

Protege-2000'’ [10] was developed by Stanford Medical Informatics (SMI) at 
Stanford University, and is the latest version of the Protege line of tools. It is an open 
source, standalone application with an extensible architecture. The core of this 
environment is the ontology editor, and it holds a library of plug-ins that add more 
functionality to the environment (ontology language import and export, etc.). 

Protege-2000 ontologies can be imported and exported with some of the back-ends 
provided in the standard release or provided as plug-ins: RDF(S, DAMLh-OIL, OWL, 
XML, XML Schema, and XML 

WebODE’ [3, 1] is an ontological engineering workbench developed by the Ontology 
Engineering Group at Universidad Politecnica de Madrid (UPM). It is an ontology 
engineering suite created with an extensible architecture. WebODE is not used as a 
standalone application but as a Web application. Three user interfaces are combined 
in the WebODE ontology editor: an HTML form-based editor for editing all ontology 
terms except axioms and rules; a graphical user interface, called OntoDesigner, for 
editing concept taxonomies and relations; and the WebODE Axiom Builder (WAB) 
[3], for creating formal axioms and rules. 

There are several services for importing and exporting ontologies: XML, RDF(S), 
DAML-hOIL, oil, owl, XCARIN, FLogic, Jess, Prolog, and Java. 



* http://oiled.man.ac.uk 

^ http://www.ontoprise.de/com/start_downlo.htm 
® http ://protege. stanford.edu/plugins .html 
’ http://webode.dia.fi.upm.es/ 
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4 Comparative Study of Ontology Platforms 

At present, there are a great number of ontologies in RDF(S) and DAML+OIL, and 
most of the RDF(S) and DAML+OIL parsers are not able to detect knowledge 
representation taxonomic mistakes in ontologies implemented in such languages [5]. 
Therefore, we have decided to analyze whether ontology platforms presented in 
section 3 are able to detect this type of mistakes during RDF(S) and DAML+OIL 
ontology import, ontology building, and ontology export to RDF(S) and DAML+OIL. 
The results of our analysis are shown in the tables using the following symbols: 

The ontology platform detects the mistake. 

1 ^ The ontology platform allows inserting the mistake, which is only detected when 
the ontology is verified. 

The ontology platform does not detect the mistake. 

© The ontology platform does not allow representing this type of mistake. 

— The mistake cannot be represented in this language. 

© The ontology platform does not allow inserting the mistake. 



4.1 Detecting Knowledge Representation Mistakes during Ontology Import 

To carry out this experiment, we have built a testbed of 24 ontologies (7 in RDF(S) 
and 17 in DAML+OIL), each of which implements one of the possible problems 
presented in section 2. In the case of RDF(S) we have only 7 ontologies because 
partitions cannot be defined in this language. This testbed can be found at 
http://minsky.dia.fi.upm.es/odeval. We have imported these ontologies using the 
import facilities of the ontology platforms presented in section 3. The results of this 
experiment are shown in table 1. Figure 2 shows the code of two of the ontologies 
used in this study: circularity at distance 2 in RDF(S) and external instance in a 
partition in DAML+OIL. 



<rdfs:Class rdf:ID="ClassA"> 

<rdfs:subClassOf rdf:resource="#ClassB" /> 

</rdfs:Class> 

<rdfs:Class rdf:ID="ClassB"> 

<rdfs:subClassOf rdf:resource="#ClassC" /> 

</rdfs:Class> 

<rdfs:Class rdf:ID="ClassC"> 

<rdfs:subClassOf rdf:resource="#ClassA" /> 

</rdfs:Class> 



<daml:Class rdf:ID="ClassA" /> 

<daml:Class rdf:ID="ClassPl" /> 

<daml:Class rdf:ID="ClassP2" !> 

<ClassA rdf:ID="Instance_A" !> 

<daml:Class rdf:about="#ClassA"> 

<daml: disjointUnionOf rdf :parseType= ' 'daml: collection' ■> 
<daml: Class rdf: about= ' 'U ClassPl ' '/> 

<dand: Class rdf: about= ' '# ClassP2 ' '!> 

</ daml: disj o intUnionO f> 

</daml:Class> 



a) Loop at distance 2 in RDF(S) b) External instance in partition in DAML+OIL 



Fig. 2. Examples of RDF(S) and DAML+OIL ontologies 



The main conclusions of the RDF(S) and DAML+OIL ontology import are: 
Circularity problems at any distance are the only problems detected by most of 
ontology platforms analyzed in this experiment. However, OntoEdit Free does not 
detect circularities at distance zero, but it ignores them. 





Table 1. Results of the RDF(S) and DAML+OIL ontology import 
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Regarding partition errors, we have only studied DAML+OIL ontologies because 
this type of knowledge cannot be represented in RDF(S). Most of ontology platforms 
used in this study cannot detect partition errors in DAML+OIL ontologies. Only 
WebODE using the ODEvaP service detects some partition errors. 

Grammatical redundancy problems are not detected by most of ontology platforms 
used in this work. However, some ontology platforms ignore direct redundancies of 
‘subclass-of or ‘instance-of’ relations. As in the previous case, only WebODE using 
the ODEval service detects indirect redundancies of ‘subclass-of relations in RDF(S) 
and DAML+OIL ontologies. 

4.2 Detecting Knowledge Representation Mistakes during Ontology Building 

In this section we analyze whether the editors of the ontology platforms detect 
concept taxonomy mistakes. We have built 17 ontologies using such ontology 
platforms. Each of which implements one of the problems presented in section 2. 

Eigure 3 shows two of the ontologies used in this study: the first represents an 
indirect common instance in a disjoint decomposition and the second represents an 
indirect redundancy of ‘subclass-of relation. 




Fig. 3. Examples of ontologies built in the ontology editors 



The results of analyzing the editors of the ontology platforms are shown in table 2. 
The main conclusions of this study are: 

Circularity problems are the only ones detected by most of ontology platforms 
used in this study. However, OntoEdit Eree detects neither circularity at distance one 
nor at distance ‘n’. Furthermore, OntoEdit Eree and WebODE have mechanisms to 
prevent ontologists from inserting circularity at distance zero. 

As for partition errors, WebODE detects only external classes in partitions. OilEd 
and Protege-2000 detect some partition errors when the ontology is verified, but these 
types of mistakes can be inserted in those ontology platforms. Most of partition errors 
are not detected by the platforms or cannot be represented in the platforms. 



http://minsky.dia.fi.upm.es/odeval 
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Regarding grammatical redundancy problems, direct redundancies of ‘subclass-of 
relations are detected by Protege-2000 and WebODE, but are forbidden by OntoEdit 
Eree. Protege-2000 also detects indirect redundancies of ‘subclass-of relations. Other 
grammatical problems are not detected or cannot be represented in the platforms. 



4.3 Detecting Knowledge Representation Mistakes during Ontology Export 

To analyze whether the export facilities of the ontology platforms detect concept 
taxonomy mistakes, we have exported to RDE(S) and DAMLh-OIL the 17 ontologies 
built in the previous experiment. After exporting these ontologies, we have analyzed 7 
RDE(S) files and 17 DAMLh-OIL files. Since RDE(S) cannot represent partition 
knowledge, this type of knowledge is lost when we export to RDE(S). 

The results of analyzing the RDE(S) and DAMLh-OIL export facilities of these 
ontology platforms are shown in table 3. The main conclusions of this stndy are: 
Circularity problems are not detected by RDL(S) and DAMLh-OIL export facilities 
of ontology platforms. Lurthermore, some ontology platforms do not allow inserting 
this type of problems, therefore the ontologies exported do not contain these mistakes. 

With regard to partition errors, no ontology platforms detect these mistakes. 
Lurthermore, some partition errors cannot he represented in ontology platforms. 

Grammatical redundancy problems are not detected by the ontology platforms 
used in this study. OntoEdit Eree and Protege-2000 do not allow inserting direct 
rednndancies of ‘subclass-of relations; therefore, neither RDL(S) nor DAMLh-OIL 
exported files can contain this type of mistake. Furthermore, some grammatical 
problems cannot be represented in the ontology platforms stndied. 



5 Conclusions and Further Work 

In this paper we have shown that only a few taxonomic mistakes in RDF(S) and 
DAMLh-OIL ontologies are detected hy ontology platforms dnring ontology import. 
We have also shown that most editors of ontology platforms detect only a few 
knowledge representation mistakes in concept taxonomies dnring ontology hnilding. 
And we have also shown that current ontology platforms are not able to detect such 
mistakes during ontology export to RDF(S) and DAMLh-OIL. 

Taking into account these results, we consider that it is necessary to check possible 
anomalies that can he made during ontology building in ontology platforms. 
Therefore it is important that these platforms help the ontologist build ontologies 
without making knowledge representation mistakes. We also consider that it is 
necessary to evaluate ontologies dnring the import and export processes. 

We also consider that we need tools for giving snpport to the evalnation activity 
during the whole life-cycle of ontologies. These tools should not only evaluate 
concept taxonomies, hnt also other ontology components (relations, axioms, etc.). 
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Abstract. Often, qualitative values have an ordering, such as (very-short, short, 
medium-height, tall) or a hierarchical level, such as (The-World, Europe, Spain, 
Madrid), which are used hy people to interpret mistakes and approximations 
among these values. Confusing Paris with Madrid yields an error smaller than 
confusing Paris with Australia, or Paris with Abraham Lincoln. And the 
“difference” between very cold and cold is smaller than that between very cold 
and warm. Methods are provided to measure such confusion, and to answer 
approximate queries in an “intuitive” manner. Examples are given. Hierarchies 
are a simpler version of ontologies, albeit very useful. Queries have a blend of 
errors by order and errors by hierarchy level, such as “what is the error in 
confusing very cold with tall?” or “give me all people who are somewhat like 
(John (plays baseball) (travels-by water-vehicle) (lives-in North-America)).” 
Thus, retrieval of approximate objects is possible, as illustrated here. 



1 Introduction 

The type of mistakes and misidentification that people make give clues to how well 
they know a given subject. Confusing Ramses with Tutankamon is not as bad as 
confusing Ramses with George Washington, or with Greenland. Indeed, teachers 
often interpret these mistakes to assess the extent of the student’s learning. 

The paper formalizes the notion of confusion between elements of a hierarchy. Fur- 
thermore, this notion is extended to hierarchies where each node is an ordered set. 
These are the main trusts of the paper. 

Some definitions follow. 

Qualitative variable. A single-valued variable that takes symbolic values. ♦ As 
opposed to numeric, vector or quantitative variables. Its value cannot be a set, al- 
though such symbolic value may represent a set. Example: the qualitative variables 
(written in italics) profession, travels-by, owns, weighs; the symbolic values (written 
in normal font) lawyer, air-bone-vehicle, horse, heavy. 

Partition. K is a partition of set S if it is both a covering for S and an exclusive set. 
♦ The members of K are mutually exclusive and collectively exhaust S. Each element 
of S is in exactly one K. 

Ordered set. An element set whose values are ordered by a < (“less than”) rela- 
tion. ♦ Example: {short, medium-length, long}. Example: {Antartica, Australia, 
Brazil, Ecuator, Nicaragua, Mexico, Germany, Ireland, Iceland), where the relation 
“<” is “South of’. 
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1.1 Hierarchy 

For a node n in a tree, relations father_of(n), son_of(n), brother_of, ascendant_of... 
are defined, as expected. ♦ 

A hierarchy H is a tree whose root is a set S, and, if a node has sons, then these 
sons form a partition of their father. ♦ This paper deals with hierarchies whose set S 
is formed by symbolic values. Often, we give names (symbolic values, strings) to the 
different subsets of S. Often, we name the hierarchy H after the set S, and we speak of 
“the hierarchy S”. Example: The Hierarchy Hj of means of travel or transportation 
vehicles, whose root is the set S = {animal, foot, bike, motor-bike, 2-seat-car, 4-seat- 
car; van, bus, train, boat, ship, helicopter, airplane} is shown in Figure 1. 



{animal, foot, bike, motor-bike, 2-seat-car, 4-seat-car; van, bus, train, 
boat, ship, helicopter, airplane] 

land-vehicle water-vehicle air-borne-vehicle 

/’f \ f 

Animal / motor-based boat 



1, , vail, 

t \ 



self-pr-vehic4e A iK 
A A motor- car Bus train 

Foot bike bike ^ 

2-seat car 4-seat car^>an 



boa{ ^ip 



f \ 

helicopter Airplane 



Hi 



Fig. 1. A hierarchy H, of transportation vehicles. Some qualitative values, like air-borne- 
vehicle, represent sets: {helicopter, airplane) in our example 



Hierarchies make it easier to compare qualitative values belonging to the same hie- 
rarchy (§2), and even to different hierarchies [COM in 4, 9]. 



Hz 




. icy < 

_ very cold < 




" 




. Lciiipciaiuic 


- cola < 

- chilly < 
' warm < 


measure 


^__,,....-.--^lig^ 


weight.^ 


^medium-weight^ 


' hot 




— heavy 


'short < 



length medium-length < 

long 



Fig. 2. A hierarchy having some ordered sets: (short < medium-length < long), (light < 
medium-weight < heavy), (icy < very cold < cold < chilly < warm < hot) 



A hierarchical variable is a qualitative variable whose values are nodes of a hie- 
rarchy. ♦ The data type of a hierarchical variable is hierarchy. 
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Example: travels-by, whose values are nodes of Hj (figure 1). Example: weighs, 
whose values are nodes weight, light, medium and heavy of H^. Note: hierarchical 
variables are single-valued. Thus, a value for travels-by can be water- vehicle, but not 
{boat, ship) although water- vehicle represents (boat, ship}. 

It is also possible for a hierarchy to have some nodes that are ordered nodes. Exam- 
ple: Hierarchy of figure 2. 



1.2 Previous Related Work 

Hierarchies are used in data warehousing and data mining; see, for instance, the H- 
sets of [1]. The paper [7] enlarges these notions with greater mathematical back- 
ground. [6] studies hierarchies where the relative proportion of each set in its father 
set is known. On the other hand, [9] deals mainly with ontologies, more elaborate data 
structures used for knowledge representation, of which CYC [2] was an early attempt 
to build an ontology for common concepts. A companion paper in this book [4], 
matches similar concepts in different ontologies. The thesis [8] describes how to map 
concepts from one ontology to another. A practical use of hierarchies is Clasitex [3], 
which finds the themes of an article written in Spanish or English. It uses the concept 
tree, and a word (not in the tree) suggests the topic of one or more concepts in the 
tree. BiblioDigital [5], a recent development, uses a large taxonomy (although not a 
hierarchy) to classify text documents. 

Work described here is similar to Pattern Classifiers, but these classify objects 
according to the values of their properties, whereas hierarchies help to classify these 
values, when they are non-numeric. 




H. 



rose 






cow 



cat 



lemon (lem) grapefruit (gf) 



Fig. 3. A hierarchy of living creatures {Iv). an stands for animal; mam for mammal; lem for 
lemon, and gf for grapefruit. See table 1 below 



2 Confusion in Hierarchies 

Who was the first Emperor of Mexico? “Agustin de Iturbide ” is the correct answer; 
“Maximilian of Hapsburg” is a close miss, “Benito Juarez” a fair error, and “Mexico 
City” a gross error. What is closer to a cat, a dog or an orange? Can we measure these 
errors or similarities? Yes, with hierarchies of symbolic values. 
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2.1 Confusion in Using r Instead of s, for a Hierarchy H 

If r, s G H, then the confusion in using r instead of s, written conf(r, s), is: 

• conf (r, r) = conf (r, s) = 0, when s is any ascendant of r. 

• conf (r, s) = 1 H- conf (r, father_of(s)). ♦ 

To measure conf, move from r to s in the hierarchy, and count the descending links 
from r to s, the replaced value, conf is not a distance, nor ultradistance. 

Table 1. conf(r, s). Confusion in using r instead of s, for hierarchy H^. r runs down, while s runs 
to the right. Thus, the black 2 is the confusion of using an animal {an) instead of a cow, while 
the confusion of using a cow instead of an animal is 0. Values (nodes) of are ordered width- 
first in the table 
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1 


3 
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Example: conf(r, s) in the hierarchy of Figure 3 is given in Table 1. 
conf resembles our sense of “closeness” between these concepts. Examples: 

conf (citric, plant) = 0; if I use citric instead of plant, the confusion is 0, since 
citric s are plants. 

conf (plant, citric) = 1; giving a plant when I wanted a citric is a “small” error; 
giving a cow when I wanted a citric is a larger error (value 2). Using these gradations 
in errors, the paper later will produce responses to queries that are “very similar to x”, 
or “somewhat similar to x”, where x is a node or a predicate. 

The confusion among two brothers, such as cow and cat, is 1. The confusion in 
using a son instead of its father is 0; the confusion in using a father instead of its son 
is 1. conf is not a symmetric function. In the next section we modify the confusion 
among two brothers to be a number < 1 , for brothers that belong to an ordered set. 

Points to ponder. The confusion in using a live being instead of a plant is 1. Thus, 
conf (animal, plant) = conf (mammal, plant) = conf (cow, plant) = 1 . This may seem 
odd, but it is not: cow, mammal, and animal are examples of live beings, and the 
confusion of using a live being instead of a plant is 1 . Another example will perhaps 
be more convincing: Say that “wine” and “beer” are brothers, so that conf (wine, 
beer) = 1: if I am given wine when I wanted beer, the confusion is 1. But this is 
exactly the same confusion if I am given red wine instead of beer, or Riesling wine 
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instead of beer, or chilled dry Riesling wine vintage 1999 instead of beer. It is always 
1, no matter how “specialized” the wine or the live being is. 

In the other direction, conf (citric, plant) = 0: if 1 am given a citric when I want a 
plant, the confusion is 0, because a citric is a plant. Another example: If I am given a 
cold beer when I want a beer, the confusion is 0. Similarly, conf (Corona_beer, beer) 
= conf (chilled_Corona_beer, beer) = 0, since all these “specialized” types of beer are, 
nevertheless, beer. 

Thus, conf (r, s), takes into account the relative position of nodes r and s in the 
hierarchy, but only when going down in our journey from r to s. When going up, no 
matter how far apart s is from r, conf is 0 “in the upwards part of the journey from r 
to s.” 



2.2 Confusion in Using r instead of s, for a Hierarchy with Some Ordered Sets 

In §2.1, the confusion between any two brother nodes is I. For ordered sets, the con- 
fusion between any two brothers depends on how far they are in their ordering. If the 
ordered set has only one element e, then conf (e, e) = 0. If it has two elements, then 
conf (el, e2) = I. For ordered sets with more than two elements, n>2, the confusion 
between two contiguous elements is l/(n-l). Figure 4 shows an example. 



Icy 


cold 


tepid 


warm 


hot 


burning 




^ . 




► 




^ 






► 







W 




0.2 


0.2 


0.2 


0.2 


0.2 



Fig. 4. A set showing the confusion between its elements 

Thus, conf (icy, cold) = conf (cold, icy) = 0.2; conf (cold, warm) = 0.4 

For a hierarchy composed of sets some of which have an ordering relation (such as 
H^), the confusion in using r instead of s, conf (r, s), is defined as follows: 

• conf (r, r) = conf (r, s) = 0, when s is any ascendant of r. 

• If r and s are distinct brothers, 

conf (r, s) = 1 if the father is not an ordered set; else, 

conf (r, s) = the relative distance from r to s = the number of steps needed to jump 
from r to s in the ordering, divided by the cardinality- 1 of the father. 

• conf (r, s) = 1 H- conf (r, father_of(s)). ♦ 

This is like conf for hierarchies formed by (unordered) sets (§0; more at [6, 7]), 
except that there the error between two brothers is 1, and here it may be a number 
between 0 and 1. Example (for H^): conf (short, measure) = 0; conf (short, length) = 0; 
conf (short, light) = 2; conf (short, medium-length) = 0.5; conf (short, long) = 1. 
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3 Queries and Graduated Errors 

This section explains how to pose and answer queries where there is a permissible 
error due to confusion between values of hierarchical variables. 



3.1 The Set of Values That Are Equal to Another, up to a Given Confusion 

A value u is equal to value v, within a given confusion e, written u =e v, iff conf(u, 
v) < 8. ♦ It means that value u can be used instead of v, within error 8. 

Example: If v = lemon (Figure 2), then 

the set of values equal to v with confusion 0 is {lemon}; 

the set of values equal to v with confusion 1 is (citric lemon grapefruit); 

the set of values equal to v with confusion 2 is (plant citric rose lemon 

grapefruit). 

Notice that =e is neither symmetric nor transitive. 

These values can be obtained from table 1 by watching column v (“lemon”) and 
collecting as u’s those rows that have conf < 8. 

That two values u and v have confusion 0 does not mean that they are identical (u 
= v). For example, the set of values equal to mammal with confusion 0 is (cow 
mammal cat}, and the set of values equal to live being (the root) with confusion 0 
contains all nodes of H^, since any node of is a live being. 



3.2 Identical, Very Similar, Somewhat Similar Objects 

Objects are entities described by a set of (property, value) pairs, which in our notation 
we refer to as (variable, value) pairs. They are also called (relationship, attribute) 
pairs in databases. An object o with k (variable, value) pairs is written as (o (Vj a^) (v^ 
a,)... (v, a,)). 

We want to estimate the error in using object o’ instead of object o. For an object o 
with k erhaps hierarchical) variables Vj, v^,.., Vj^ and values Uj, a^, ..., aj^, we say about 
another object o’ with same variables Vj...Vj^ but with values a^’, a^’,... a^’, the 
following statements: 

o’ is identical to o if a’ = a, for all 1< i < k. All corresponding values are identical. 
♦ If all we know about o and o’ are their values on variables Vj,...Vj^, and both 
objects have these values pairwise identical, then we can say that “for all we 
know,” o and o’ are the same. 

o’ is a substitute for o if conf (a2, a) = 0 for all 1< i < k. ♦ All values of o’ have 
confusion 0 with the corresponding value of o. There is no confusion between 
a value of an attribute of o’ and the corresponding value for o. 
o’ is very similar to o if Z conf (a’, a) = 1. ♦ The sum of all confusions is 1. 
o’ is similar to o if Z conf (a2, a) = 2. ♦ 
o’ is somewhat similar to o if Z conf (a’, a) = 3. ♦ 

In general, o’ is similar^^ to o if Z conf (a3 , a.) = n. ♦ 

These relations are not symmetric. 
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Table 2. Relations between objects of Example 1. This table gives the relation obtained when 
using object o’ (running down the table) instead of object o’ (running across the table) 





Ann 


Bob 


Ed 


John 


Ann 


identical 


similar. 


somewhat similar 


similar. 


Bob 


very similar 


identical 


very similar 


similar. 


Ed 


similar 


similar, , 


identical 


similar, 

n 


John 


substitute 


similar. 


similar, , 


identical 



Example 1 (We use hierarchies Hj, and H^). Consider the objects 
(Ann (travels-by land- vehicle) {owns animal) {weighs weight)) 

(Bob {travels-by boat) {owns bird) {weighs heavy)) 

(Ed {travels-by water-vehicle) {owns plant) {weighs medium-weight)) 

(John {travels-by car) {owns cow) {weighs light)). 

Then Ann is similar^ to Bob; Bob is very similar to Ann; Ann is somewhat similar 
to Ed; Ed is similar,, to Bob;' Bob is similar, to John, etc. See Table 2. 

Hierarchical variables allow us to define objects with different degrees of 
precision. This is useful in many cases; for instance, when information about a given 
suspect is gross, or when the measuring device lacks precision. Queries with “loose 
fit” permit handling or matching objects with controlled accuracy, as exposed below. 



3.3 Queries with Controlled Confusion 

A table of a data base stores objects like Ann, Bob... defined by (variable, value) 
pairs, one object per row of the table. We now extend the notion of queries to tables 
with hierarchical variables,^ by defining the objects that have property P within a 
given confusion 8, where 8 > 0. 

P holds for object o with confusion 8, written holds for o, iff 

• If Pg is formed by non-hierarchical variables, iff P is true for o. 

• For pr a hierarchical variable and Pg of the form {pr c),^ iff for value v of 
property pr in object o, v =g c. [if the value v can be used instead of c with 
confusion 8] 

• If Pg is of the form PI v P2, iff Pig holds for o or P2g holds for o. 

• If Pg is of the form PI a P2, iff Pig holds for o and P2g holds for o. 

• If Pg is of the form —.PI, iff Pig does not hold for o. ♦ 

The definition of Pg holds for o allows control of the “looseness” of P or of some 
parts of P; for instance, the predicate {plays guitar)^ will match people who play guitar 



' conf (water-vehicle, boat) = 1; conf (plant, bird) = 2; conf (medium-weight, heavy) = 0.5; 
they add to 3.5. 

^ For non hierarchical variables, a match in value means conf = 0; a mismatch means conf = tx> 
^ (pr c) in our notation means: variable pr has the value c. Example: (profession Engineer). It 
is a predicate that, when applied to object o, returns T or F. 
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or any of the variations (sons) of guitar (refer to Figure 5); (plays guitar)j will match 
those people just mentioned as well as people who play violin and harp. 



What do we mean by “P holds for o” when we do not specify the confusion of P? If 
P and o are not formed using hierarchical variables, the meaning is the usual meaning 
given in Logic. Nevertheless, if P or o use hierarchical variables, then by “P holds for 
o” we mean “P^ holds for o”. This agrees with our intuition: predicate (owns chord- 
instrument), given without explicitly telling us its allowed confusion, is interpreted as 
(owns chord-instrument)^, which will also match with a person owning an electric- 
guitar, say. 



chord-instrument 



musical-instrument 

t 

wind-instrument 



a 



keyboard-instrument 



violin guitar Harp flute clarinet saxophone piano harpsichord 



electric 



ic-guitar Spanish-gui 



guitar 



Fig. 5. A hierarchy of musical instruments 



Example 2 (refer to hierarchies and persons of Example 1). Let the predicates 
P = (tmvels-by bike) v (owns cow), 

Q = (tmvels-by helicopter) a (owns cat), 

R = — 1 (travels-by water- vehicle). 

Then we have that P^ holds for John; Pj holds for John, P^ holds for {Ann, Bob, 
John}, Pj holds for {Ann, Bob, Ed, John}., as well as P^, P,,... 

We also have that holds for nobody; Qj holds for nobody; holds for {Ann, Bob, 
John}; Qj holds for {Ann, Bob, Ed, John}, as well as Q^, Q,... 

We also have that R^j holds for {Ann, John}; Rj holds for nobody, as well as R^, R^, 

R4... 

Erom the definition of Pg holds for o, it is true that (P v Q)g = (P^ v Q^). This 
means that for (P v Q)^ = (Pj^ v Q^), a = min (b, c). Similarly, for (P a Q)^ = (P^^ a Q^), 
we have a = max (b, c). 



Accumulated confusion. Eor compound predicates, a tighter control of the error or 
confusion is possible if we require that the accumulated error does not exceed a 
threshold 8. This is accomplished by the following definition. 

P holds for object o with accumulated confusion e, written P^ holds for o, iff 

• If P*^ is formed by non-hierarchical variables, iff P is true for o. 

• Eor pr a hierarchical variable and P*^ of the form (pr c), iff for value v of property 
pr in object o, v =e c. [if the value v can be used instead of c with confusion 8] 

• If P*^ is of the form PI v P2, iff Pl*^ holds for o or P2“^ holds for o. 
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• If is of the form PI a P2, iff there exist confusions a and h such that a+b = 8 
and Pl“ holds for o and P2'’ holds for o. 

• If P*^ is of the form — iPl, iff Pl^ does not hold for o. ♦ 

Example 3: For Q = (travels-by helicopter) a (owns cat), we see that Q° holds for 
nobody; Q’ holds for nobody; holds for nobody; holds for John; Q"* holds for 
{Ann, Bob, John}; holds for (Ann, Bob, Ed, John}, as well as Q*”, Q’... 



Closeness. An important number that measures how well object o fits predicate Pg is 
the smallest 8 for which Pg(o) is true. This leads to the following definition. 

The closeness of an object o to a predicate Pg is the smallest 8 which makes Pg true. 
♦ The smaller this 8 is, the “tighter” Pg holds. 

Example: (refer to hierarchies, persons and predicates of Example 2) The closeness 
of Pg to John is 0; its closeness to Ann is 2; to Bob is 2, and to Ed is 3. This means 
that John fits Pg better than Ed. See Table 3. 



Table 3. Closeness of an object to a predicate. Persons, hierarchies and predicates are those of 
example 2 





P. 


Q. 


Re 


Ann 
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0 


Bob 
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2 


oo 


Ed 
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2 


oo 


John 


0 


3 


0 



4 Conclusions 

The paper shows a way to introduce ordered sets into hierarchies. 

Hierarchies can be applied to a variety of jobs: 

To compare two values, such as Madrid and Mexico City, and to measure their 
confusion (§0), for instance in answering query “What is the capital of 
Spain?” 

To compare two objects for similarity, like Ann and Ed (§0), giving rise to the 
notions of identical, very similar, similar... objects (not values). 

To find out how closely an object o fits a predicate Pg (definition of closeness, 

§ 0 ). 

To retrieve objects that fit imperfectly a given predicate to a given threshold, 
using Pg holds for o (confusion, §0 and example 2), and holds for o 

(accumulated confusion, §0 and example 3). 

To handle partial knowledge. Even if we only know that Ed travels-by water- 
vehicle, we can productively use this value in controlled searches 
(Example 1 of §0). 
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Hierarchies make a good approximation to the manner in which people use 
gradation of qualitative values (ordered sets), to provide less than crisp, but useful, 
answers. 

Ordered sets add a further refinement to the precision with which confusion can be 
measured and used. 

Hierarchies can also be used as an alternative to fuzzy sets, defining a membership 
function for a set with the help of closeness. 

They can also be employed as a supervised pattern classifier, by using definitions 
of §0 that measure how close two objects are, and by using definitions of Pg and 
(§ 0 ). 

In [7] we describe a mathematical apparatus and further properties of functions and 
relations for hierarchies. Instead, [4, 9] explain similar functions, relations and exam- 
ples for ontologies. 
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Abstract. A concise manner to send information from agent A to B is to use 
phrases constructed with the concepts of A: to use the concepts as the atomic 
tokens to he transmitted. Unfortunately, tokens from A are not understood by 
(they do not map into) the ontology of B, since in general each ontology has its 
own address space. Instead, A and B need to use a common communication 
language, such as English: the transmission tokens are English words. 

An algorithm is presented that finds the concept Cb in (the ontology of B) 
most closely resembling a given concept c*. That is, given a concept from 
ontology O^, a method is provided to find the most similar concept in O^, as 
well as the similarity sim between both concepts. Examples are given. 



1 Introduction and Objectives 

How can we communicate our concepts^ what we really mean? Two persons (or 
agents) A and B can communicate through previously agreed stereotypes, such as the 
calling sequence between a caller program and a called subroutine. This requires 
previous agreement between A and B. This paper deals with communication with 
little previous consensus: A and B agree only to share a given communication 
language. The purpose of the communication is for A and for B to fulfill its 
objectives or goals. That is, we shall define a successful communication if A and B 
are closer to their goals as the result of such communication. 

What can an agent do to meaningfully communicate with other agents (or persons), 
even when they had not made any very specific comitment to share a private ontology 
and communication protocol? Concept communication can not be fulfilled through 
direct exchange of concepts belonging to an ontology, since they do not share the 
same ontology. Instead, communication should be sought through a common 
language. Lucky agents can agree on a language whose words have a unique meaning. 
Others need to use an ambiguous language (such as a natural language) to share 
knowledge. This gives rise to imperfect understanding and confusion. This is the trust 
of this paper. 

The objective of this work is to find the most similar (in meaning) object in B’s 
ontology corresponding to a given object in A’s ontology, and to measure their 
similarity. Example: Assume A wants to transmit its concept grapefruit' to B. 



* In this paper, concepts appear in Courier font. 

R. Monroy et af (Eds.): MICAI 2004, LNAI 2972, pp. 129-138, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 
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To this end, A translates it into word grapefruit, which is then transmitted to B. But B 
has no such word in its ontology. Thus, B asks A “what is a grapefruit?” A answers 
“it is a citric” (by seeing that citric is the father of grapefruit in O^). 
Unfortunately, B has no concept to map word “citric”. So B asks A “what is a citric?” 
A answers “it is a fruit”. Now, Og has concept fruit denoted by word fruit. But 
f ruitfi (the subindex B means “in Og”) has several children: B knows several fruits. 
Now B has to determine wich of the children of fruitg most resembles 
grapefruit^. It may do so by seeing which child of fruitg has children quite 
similar to those children of grapefruita- Or by seeing which fruits in Og have 
skin, bone, weight... similar to those of grapef ruita. Unfortunately, the problem is 
recursive: what is skin for B is epidermis for A, and peel for C. weighta is in 
kilograms, whereas weights is in pounds. So the comparison has to continue 
recursively. §2 gives a precise description of the algorithm. 



living_thing 
(creature, organism, live being) 



plant_living 
(plant, vegetal) 

X N 

e plant eatabX' 



ornate_plant 

(ornate plant) 



animal_living 

/ (animal) ^ 

, \ 



frui' 

(...) 



e_plant farm_animal savage_animal 

/ (...)w .••** (••■) (wild animal, beast) 

\ f "' t T* \ 

LL vegetable chicken ’ 



;get£ 

(...) 



zebra 



(chicken, hen, cock) (zebra) 



lion 

(lion) 



Fig. 1. An ontology consists of a tree of concepts (nodes) under the subset relation (solid 
arrows), with other relations such as eats (dotted arrows), and with words associated to each 
concept (in parenthesis after each concept; some are omitted). Nodes also have (property, 
value) pairs, not shown in the figure 



1.1 Ontologies 

Knowledge is the concrete internalization of facts, attributes and relations among real- 
world entities ♦ It is stored as concepts; it is measured in “number of concepts.” 
Concept. An object, relation, property, action, idea, entity or thing that is well known 
to many people, so that it has a name: a word(s) in a natural language. ♦ 
Examples: cat-chien, to_f ly_in_air, angry_mad. So, concepts have 
names', those words used to denote them. A concept is unambiguous, by definition. 
Unfortunately, the names given by different people to a concept differ and, more 
unfortunately, the same word is given to two concepts (examples: words mole; 
star; can). Thus, words are ambiguous, ^ while concepts are not. A person or agent, 
when receiving words from some speaker, has to solve their ambiguity in order to 



2 



Some symbols or words are unambiguous: 3, Abraham Lincoln, ti, (30°N, 15°W). 
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understand the speaker, by mapping the words to the “right” concept in his/her/its 
own ontology. The mapping of words to concepts is called disambiguation. 

If two agents do not share a concept, at least partially, they can not 
communicate it or about it. A concept has (property, value) pairs associated with it. 

Ontology. It is a formal explicit specification of a shared conceptualization [5].4 It is 
a hierarchy or taxonomy of the concepts we know.^ We represent an ontology as a 
tree, where each node is a concept with directed arcs (representing the relation 
subset and, at the leaves, perhaps the relation member_of instead of subset) 
to other concepts. Other relations (such as part_of, eats-ingests, li- 
ves_in, ...) can be drawn, with arcs of different types (figure 1). In general, 
these relations are also nodes in another part of the ontology. 

Associated words. To each concept (node) there are several English words'* 
associated: those who denote it or have such concept as its meaning. Example: 
concept mad_angry has associated (is denoted by) words angry, crossed, pissed- 
of, mad, irritated, incensed. Example: Word mole denotes a small_rodent, a 
spy_inf iltrator and also a blemish_in_skin. 



1.2 Related Work 

[12] represents concepts in a simpler format, called a hierarchy. Most works (for 
instance [11]) on ontologies involve the construction of a single ontology, even those 
that do collaborative design [8]. Often, ontologies are built for man-machine inter- 
action [10] and not for machine-machine interaction. [1] tries to identify conceptually 
similar documents, but uses a single ontology. [3, 4] do the same using a topic 
hierarchy: a kind of ontology. [9] seeks to communicate several agents sharing a 
single ontology. The authors have been motivated [6, 7] by the need of agents to 
communicate with unknown agents, so that not much a priori agreement between 
them is possible. With respect to concept comparison, an ancestor of our COM (§2, 
appears first in [13]) matching mechanism is [2], based on the theory of analogy. 



2 Most Similar Concepts in Two Different Ontologies 

The most similar concept Ce in 0^ to concept Ca in is found by the COM algorithm 
using the function sim{ctd (called “hallar (Ca)” in [13]) as described in the four cases 
below. It considers a concept, its parents and sons. In this section, for each case, a 
tree structure shows the situation and a snapshot of a screen presents an example. 
Assume that agent A emits (sends) to B words^ corresponding to Ca, and also sends 
words corresponding to the father of Ca, denoted by Pa. COM finds Cb = sim{ci^, the 
concept in Og most similar to Ca. sim also returns a similarity value a number 
between 0 and 1 denoting how similar such returned concept Cb was to Ca- 



^ Each concept that I know and has a name is shared, since it was named by somebody else. 
“* Or word phrases, such as “domestic animal”. 

^ Remember, an agent can not send a node to another agent, just words denoting it. 
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Fig. 2. Case (a). Words from and 
Pa match words from Cb and pg 




Fig. 3. Case (b). Words from p^ 
match words from pe but c* has no 
equivalence 



Case a) We look in Og for two nodes Pb and Cb, such that: (1) the words associated to 
Cb coincide with most of the words (received by B from A)® of Ca; and (2) the 
words associated to Pb coincide with most of the words corresponding to pa; and 
(3) Pe is the father, grandfather or great-grandfather’ of Cb- 
If such Pb and Cb are found, then Cb is the nearest concept to Ca; the answer is Ce and 
the algorithm finishes returning iv = 1. Figure 2 represents this situation. Figure 4 
shows the screenshot of COM when seeking in B the concept most similar to 
applCA- The answer is concept applCB in B with sv = 1. 



Load Ontology A | 




Load Ontology B | 




DVJesus'valencionSCOM ejemploScaseAa.hee 




^;V)esus'talencion\COM ejemplo\caseAb.Iree 




B- THING;. 




i+: INANIMATE;. 


3 


B- INANIMATE;. 




B ANIMATED:. 




B ANIMATED:. 




ANIMAL:BORN-YES.DIE-YES. 






ANIMAL:BORN-YES,DIE-YES. 




• B VEGETABLE:PRODUCE-OXIGEN. 






& VEGETABLE:PRODUCE-OXIGEN. 




B FRUIT:PRODUCE-VITAMIN. 






a FRUIT:PRODUCE-VITAMIN. 




; B TROPICAL FRUIT:GR0UND-H0T. 






B TR0PICAL_FRUIT:GR0UND-H0T. 




^ B- TEMPERATE_FRUIT:GROUND-TEMPER 






B TEMPERATE FRUIT:GROUND-TEMPER 




^^^^^^^■APPLE:SHAPE-R0UND,C0L0R-RED> 






|APPLE;SHAPE-R0UND.C0L0R-RED COLOR.TASTE-SWEET TASTE.TEXTURE■SM00TH.^R0UND.C0L0R■0RA 






PEACH;SHAPE-R0UND,C0L0R-0RA 






— i 




^ B C0LD_FRUIT:GR0UND-C0LD. 




B- VERDURE:. 


jtl_ 


1 .lt 




dl Z 1 aT 


Concept: |APPLE 


Result: 




=TopertiedSHAPE-ROUND,COLOR-RED_COLOR.T/! |APHLb 






Ontoloqv Matchl 





Fig. 4. Case (a). Screen with the execution of COM for the case shown in Fig. 2 



Case b) This case occurs when (2) of case (a) holds, but (1) and (3) do not. Pb is 
found in Og but Cb is not. See Figure 3. In this case, sim (which Olivares calls ho- 
llar) is called recursively, and we try to compute Pb' = sim(pf,) to confirm that Pb is 
the ancestor of concept of interest (ca). 



® We have found useful the threshold 0.5: more than half of the compared entities must 
coincide. 

’ If Pb is found more than three levels up, the “semantic distance” is too high and sim says “no 
match.” 
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(1) If the Pg' found is thing, the root of Og, the algorithm returns not_found and 
concludes; jv = 0; 

(2) Otherwise, a special child of Pg, to he called Cg', is searched in Og, such that: 

A. Most'’ of the pairs (property, value) of Cg' coincide with the corres- 

ponding pairs of c^. Children of Pg with just a few matching properties" 
or values are rejected. If the candidate Cg' analyzed has children, they 
are checked (using sim recursively) for a reasonable match" with the 
children of c^. If a Cg' is found with the desired properties, the algorithm 
reports success returning Cg' as the concept in Og most similar to c^. = 

the fraction of pairs of Cg' coinciding with corresponding pairs of c^. 

B. Otherwise Cg' is sought among the sons of the father (in B) of Pg; that is, 
among the brothers of Pg; if necessary, among the sons of the sons of Pg; 
that is, among the grandsons of Pg. If found, the answer is Cg'. sv = the 
^v returned by Cg' multiplied by 0.8 if Cg' was found among the sons of 
Pg,* or by 0.8^ = 0.64 if found among the grandsons of Pg. 

C. If such Cg' is not found, then the node nearest to c^ is some son of Pg, 
therefore sim returns the remark (son_of Pg) and the algorithm 
concludes, ^v = 0.5 (an arbitrary but reasonable value). For example, if 
A sends words that correspond to the pair (c^ = kiwi, p^ = fruit), and B 
has the concept fruit but doesn't have the concept kiwi nor any similar 
fruit, in this case, the concept kiwi (of A) is translated by B into (son_of 
fruit), which means “some fruit I don't know” or “some fruit I do not 
have in my ontology.” 

Figure 5 shows the execution of COM for case (b)2(A). In this case concept kiwi* 
has no equivalent in B. Here rare_fruitE is chosen from B as the most similar 
concept because parents coincide and properties of kiwi* and rare_fruitB are 
similar (that was calculated using COM recursively for each property- value), sv = 0.8 
because the exact equivalent concept in B was not found. 

Case c) This case occurs when (1) of case (a) holds but (2) and (3) do not. See figure 
6. Cb is found but Pb is not. We try to ascertain whether the grandfather (in Og) of 
Cb has words that match those of p* (corresponding words that are equal exceed 
50%), or if the great-grandfather of Cb in Og has such matching words. 

(1) If that is the case, the concept in Og more similar to p^ is the grandfather (or 
the great-grandfather) of Cg, and the algorithm finishes returning Cg. s'v = 0.8 
for the grandfather case, and 0.8^ for the great-grandfather case. 

(2) Otherwise (parents do not match), we verify two conditions: 

A. Most" of the properties (and their corresponding values) of Cg should 
coincide (using sim) with those of c^ ; and 

B. Most of the children of c^ should coincide (using sim) with most" of the 
children of Cg. 

If the properties in (A) and the children in (B) coincide, the algorithm 
concludes with response Cg, although it did not find in Og the Pg that 
corresponds to the concept p^ in O^. s'v = the fraction of properties and children 
of Cg matching with corresponding entities of c^. 



We have found that 0.8 allows for a fast decay as one moves up from father to grandfather 
and up. 
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(3) If even fewer properties and children are ^^>Milar then response is (probably 
Cg) and the algorithm finishes, jv is computed like in (2)B. 

(4) If neither properties nor children are similar, response is not_found and the 
algorithm finishes, sv = 0. 



Load Ontology A | 






Load Ontology B | 


I;\Jesus\atendon\COM ejemplo\caseBa.lree 






C:VJesus\aIencion\COM ejemplo\caseBb.lree 


B- THING:. 






B THING:. 


B- INANIMATE:. 






0- INANIMATE;. 


B ANIMATED:. 






E 


3 ANIMATED:. 




ANIMAL:BORN-YES,DIE-YES. 








ANIMAL:BORN-YES,DIE-YES. 




B VEGETABLE:PRODUCE-OXIGEN. 








B VEQETABLE:PRODUCE-OXIGEN. 




B- FRUIT:PR0DUCE -VITAMIN. 








BFRUIT;PR0DUCE -VITAMIN. 




0 TEMPERATE FRUIT:GROUND-TEMPER 








0 TEMPERATE FRUIT:GROUND-TEMPER 




0 COLD FRUIT:GR0UND-C0LD. 








: 0 COLD FRUIT;GR0UND-C0LD. 




MAMEE:C0L0R-BR0WN C0L0RJEXTL_ 








MAMEE:C0L0R-BR0WN C0L0RJEXTL_ 




MANG0:SHAPE-U\RGE.C0L0R-YELL0y 








MANG0:F0RMA-ALARGAD0,C0L0R-C0! 




IKIWLCOLOR-BROWN COLOR.INTERIOR-GREEN COLOR.I 


BEMa WWMWiiWlTOill ^ 1 


•d 


11:1: 1 aT 






1 aT 


Concept |KIWI 




Result 


kopertiedCOLOR-BROWN_COLOR.INTERIOR-GRE 


|RAR 


E_FRUIT 




Ontology Match 


1 sv: |08 



Fig. 5. Case (b). Screen with the execution of COM corresponding to Figure 3 




Fig. 6. Case (c). Words from match 
with words of Cg but there is no equi- 
valence for words of . See Figure 




Fig. 7. Case (d). There are no words 
from c^ nor that match with words 
ofB. 



Figure 8 shows an example of case (c)(2). In this case we use COM to seek in B 
the most similar concept to applSA- Here concepts match but parents do not 
(fruipA, foodB) (words are different for each parent), therefore similarity of the 
properties are used (calling recursively to COM), jv = 0.8 because parents do not 
coincide. 

Case d) If neither Cb nor Pb are found, the algorithm concludes returning the response 
not_f ound. ^v = 0. Ca could not find a similar node in Og . The agents may have 
different ontologies (they know about different subjects) or they do not share a 
common communication language. See figures 7 and 9. 

Figure 9 shows the execution of case (d). Observe that ontology is mainly about 
fruits while Og is mainly about Computer Science. There are some concepts in 
common, but not the involved concepts, ^v = 0. 
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Load Ontology A 






Load Ontology B 1 


':\Jesus\atencion\COM ejeinplci\caseCd.tree 






';VJesus\atencion\COM ejemploVcaseCb.bee 


B- THING:. 








B THING:. 




B- INANIMATE:. 






B- INANIMATE:. 




B- ANIMATED;. 






B ANIMATED:. 






ANIMAL:BORN-YES,DIE-YES. 








; ANIMAL:BORN-YES,DIE-YES. 






B- VEGETABLE:PRODUCE-OXIGEN. 








B FOOD:. 






B FRUIT:PRODUCE-V[TAMIN. 








^■1APPLE:SHAPE-RUUND.C0LUR-RED COLOf 








m TROPICAL FRUIT:GROUND-HOT. 








PEACH:SHAPE-RQUND.CQLOR-ORANGE C 








B- TEMPERATE FRUIT:GROUND-TEMPER 








BREAD;COLOR-BROWN COLOR. 








|APPLE:SHAPE-ROUND,COLOR.RED 


COLORJASTE-SWEET TASTEJEXTURE-SMOOTH.I 








PEACH:SHAPE-ROUNDXOLOR-ORA 






B ACTION:. 








m COLD_FRU[T:GROUND-COLD. 






BUY;, 


zJ 






1 JlT 




J!J 




Concept; [APPLE 


Result: 


^ropertiedSHAPE-ROUND ,COLOR-RED_COLOR ,T/! [APPLE 








Ontology Match 


1 sv: |03 



Fig 8. Case (c). Screen with the execution of COM corresponding to figure 6 



Load Ontology A 



;VJesu$Wencion\COM ejemploVcaseDa.tree 



Load Ontology B | 

C:VlesusWencion\COM ejemploVcaseDb.lree 



B- THING:. 

$• INANIMATE:, 

B ANIMATED:. 

ANIMAL;BORN-YES,DIE-YES. 
i VEGETABLE:PRODUCE-OXIGEN. 

El- FRUIT:PRODUCE-VITAMIN. 

m TROPICAL_FRUIT:GROUND-HOT. 

B- TEMPERATE_FRUIT:GRQUND-TEMPER^ 



TOOL:CONSISTENCY-HARD,USE-WORKSH(_- 

TRANSPORT:USE-MOVEMENT. 

RECIPIENT:USE-CONTAINER. 

: B INTANGIBLE:MASS-NO,WEIGHT-NO. 
m ACTION;. 

B COMPUTER;. 

B NETWORK;. 

^ PROTOCOL. 



|APPLE:SHAPE-RQUND,COLOR-RED_COLOR,TASTE-SWEET_TASTE.TEXTURE-SMOQTH.| 
PEACH;SHAPE-R0UND.C0L0R-0RA I KEYBOARD:. 

I COLD FRUIT;GROUND-COLD. MONITOR:. 



iL 



Concept: 



|apple 



Result: 



'ropertie^SHAPE-ROUND,COLOR-RED_COLOR.T^ |not_found 
Ibntoloqy Matcril ^ ^ 



Fig 9. Case (d). Screen with the execution of COM for case (d). Ontologies refer mostly to 
different areas. COM returns not found with sv = 0 



sim is not symmetric. If Cb is the concept most similar to c^, it is not necessarily 
true that c* is the concept most similar to Cb- Example: knows ten kinds of 

hammera, while 0^ only knows hammerB (a general hammer). Then, COM maps 
each of the ten hammera into hammerB, while hammerB best maps into, say, 
hammer_f or_carpentera [12]. 

The function sim is only defined between a concept Ca in and the most similar 
concept Cb in Og. 
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Ontology A 

thing : . 

living_creature : . 
animal : . 

plant_living : color-green_color , produce- 

oxigen . 

melon : . 
bean: . 

tool : . 

screwdriver : . 
key_tool : . 
data : . 

field: . 

integer_f ield : . 
real_f ield : . 
double_f ield : . 
f loat_f ield : . 
key_data : . 

foreign_key: . 

Primary_key: . 

Fig. 10. Ontology A. Used to compute similarity to concepts in ontology B 



2.1 Examples of Similarity 

Now we give examples for sim, the similarity between two concepts, each from one 
ontology. Here we assume that properties like relations and colors are part of both 
ontologies. For simplicity properties are shown only where needed. Properties appear 
after the colon as relation-value pairs. For ontologies A and B (Figures 10 and 11): 
iim(fieldA) = fields with sv = 1 because words of concepts and parents 
coincide. This is an example of case (a). 

iim(key_toolA) = key_toolB with w = 1. This is an example of case(a), where 
words of the parent and concept in A match words of corresponding nodes in B. 
Although word ‘key’ denotes (belongs to the associated words of) both concepts 
key_dataE and key_toolB, the words of toolA only map into those of tools 
and key_toolB is selected without ambiguity. 

5im(screwdriverA) = (son_of tools) with sv = 0.5. This is case (b): 
parents coincide, but in ontology B there is no concept similar to screwdriverA, 
therefore the algorithm detects that agent A is referring to a son of concept tools- 
5im(plant_livingA) = vegetables with sv = 0.8. This an example of case 
(b) when parents coincide but the concepts do not. In this case properties of concepts 
are used to establish the similarity among concepts. The similarity of the properties is 
calculated using the COM recursively for each property and value. 

iim(double_f ieldA) = not_found and w = 0. This is an example of case (d) 
when no concept nor parent are found in B. The ontology A has sent B a concept of 
which B has no idea. 
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Ontology B 

thing : . 

living_creature : . 
animal : . 

vegetable : color-green_color,produce-oxigen. 
apple : . 
bean: . 

tool : . 

hammer : . 
key_tool : . 
data : . 

field: . 
key_data : . 

Fig. 11. Ontology B. Used to compute similarity to concepts in ontology A 

irm(melonA) = not_found and jv = 0. This is other example of case (d) where 
words sent to B from A do not match a pair parent-concept in B. 



2.2 Conclusions 

Methods embodied in a computer program are given to allow concept exchange and 
understanding between agents with different ontologies, so that there is no need to 
agree first on a standard set of concept definitions. Given a concept, a procedure for 
finding the most similar concept in another ontology is shown. The procedure also 
finds a measure of the similarity sv between concepts Ca and Cb- Our methods need 
further testing against large, vastly different, or practical ontologies. 

In contrast, most efforts to communicate two agents take one of these approaches: 

1. The same person or programmer writes (generates) both agents, so that pre- 
established ad hoc communicating sequences (“calling sequences,” with 
predefined order of arguments and their meaning) are possible. This approach, of 
course, will fail if an agent is trying to communicate with agents built by 
somebody else. 

2. Agents use a common or “standard” ontology to exchange information. This is the 
approach taken by CYC [11]. Standard ontologies are difficult and slow to build 
(they have to be designed by committee, most likely). Another deficiency: since 
new concepts appear each day, they slowly trickle to the standard ontology, so that 
it always stays behind current knowledge. 

Even for approach (2), a language to convey other entities built out of concepts: 
complex objects (which do not have a name), actions, desires, plans, algorithms... 
(not just concepts) is needed. Such language is beyond this paper; hints of it at [13]. 

Our approach allows communication in spite of different ontologies, and needs 
neither (1) nor (2). 
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Abstract. As XML is becoming widely accepted as a mean of stor- 
ing, searching and extracting information, a larger number of Web 
applications will require conceptual models and administrative tools to 
organize their collections of documents. Recently, event-condition-action 
(ECA) rules have been proposed to provide reactive functionality into 
XML document databases. However, logical inference mechanisms to 
deliver multiagent-based applications remain unconsidered in those 
models. In this paper, we introduce ADM, an active deductive XML 
database model that extends XML with logical variables, logical 
procedures and ECA rules. ADM has been partially implemented in an 
open distributed coordination architecture written in Java. Besides of 
coupling the rational and reactive behavioral aspects into a simple and 
uniform model, a major contribution of this work is the introduction 
of sequential and parallel rule composition as an effective strategy to 
address the problem of scheduling rule selection and execution. 

Keywords: XML, Semantic Web, Deductive Databases, Active 
Databases. 



1 Introduction 

The motivation behind this work is in the increasing use of the Extensible Markup 
Language, XML, as a mean of storing structured information that cannot be con- 
veniently stored using the available database technology. Examples of handling 
XML document collections are recent Web-based recommendation systems that 
have increasingly adopted the publish/subscribe model to disseminate relevant 
new information among an users community. Actually, in the active database 
community it has been proposed to extend the persistent storage functionality 
of databases to support rule production systems [2]. By extending the rela- 
tional database model with expert systems technology, a number of application- 
independent tasks can be undertaken by the Active Database Management Sys- 
tem, active DBMS, e.g. enforcement of data integrity constraints and data pro- 
tection, versioning maintenance and data monitoring for knowledge discovery 
and acquisition. Recently, XML active DBMS have been suggested to mirror 
major results of active DBMS to the XML document management domain [1]. 
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Another substantial amount of research has been devoted to deductive data- 
base systems [6]. Those systems are similar to relational database systems in 
that both are passive, responding only to user queries. Deductive DBMS extend a 
relational DBMS with a Prolog-like inference engine to answer possibly recursive 
queries in order to derive conclusions logically entailed in the database content. 
As the expressive power of both the query model and the language becomes 
richer then the deductive DBMS is a more convenient mean to describe the 
complex conditions used in production rules. 

The DARPA Agent Markup Language, DAML [4], has been developed as an 
extension of XML by using ontologies to describe objects and their relations to 
other objects. We follow, instead, a theorem prover approach. 

In this work, we introduce ADM to address the problem of coupling XML 
active databases with deductive databases to support multiagent-based applica- 
tions. The adopted approach consists on extending XML with logical variables, 
logical procedures and EGA rules. The extensions allow a user to annotate XML 
documents with logical constraints among other related features to enforce in- 
tegrity. Event annotations allow an user to describe where the insertion or dele- 
tion events take place, making available the content of the document involved 
to test the integrity constraints. The experimental system developed so far can 
be considered an open distributed coordination architecture that enables third- 
party applications to use the already created XML documents. It provides basic 
language constructs to coordinate document content flow in Web-based applica- 
tions. ADM is built upon LogCIN-XML, an experimental deductive XML database 
developed by the first two authors of the current paper, that in turn has been 
written in Prolog-Cafe [3], a WAM-like Prolog engine that sits on the Java Vir- 
tual Machine. While the Prolog-Cafe provides the basic automated inferencing 
capabilities, a set of Java servlets provide the core functionality for coordination 
and communication among clients and servers. 



2 Extensional Databases 

XML extensional databases are collections of standard XML documents main- 
tained on a distributed directory structure across the Internet. ADM preserves the 
open character of extensional databases by keeping apart language extensions 
from the data created by third-party applications. From a logic programming 
point of view, XML elements roughly correspond to ground terms. Nonetheless, 
XML elements have a more complex structure than those of first order logical 
languages due to the presence of a list of attribute-value pairs. Assuming that 
such a list is in ascending order by the name of the attribute, the structure of an 
XML element can be rewritten into a syntactically equivalent Prolog term. As 
an example. Table 1 shows a collection of XML documents containing family re- 
lations information in which all documents share the same simple structure. The 
root element fatherof contains two child elements father and son describing 
the role of the participants. Thus, bob and tom are two persons related by the 
fatherof relation, having the roles father and son respectively. 
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Table 1. A family relations extensional XML database. 



(fatherof )(father )bob(/f ather) (son )joe(/son)(/fatherof) 
(fatherof )(father )bob(/f ather) (son )tom(/son)(/fatherof) 
(fatherof )(father ) joe(/father) (son )doe(/son)(/fatherof) 



3 Intentional Databases 

Intentional databases are collections of logical procedures and EGA rules called 
programs. The reason for maintaining language extensions isolated from data is 
twofold: firstly, to create a low-level conceptual layer of the fundamental concepts 
that underlie the application, and secondly, to create an upper-level conceptual 
layer to describe the integrity constraints and consistency-preserving rules that 
can be logically derived from the extensional database contents. 

XML elements are extended with logical variables to introduce the notion 
of XML terms. Logical variables are used to define the integrity constraints 
imposed on the document contents and to formulate queries to bind values to 
variables. Logical variables can be of either type string (prefixed by a ’$’) or 
term (prefixed by a ’#’) and they may occur in elements and in attributes. The 
function var : XMLTerm — > P XMLVariable is defined to obtain the set of variables 
occurring in a term (here P stands for the power set operator). 

Integrity constraints are introduced by means of logical procedures. A log- 
ical procedure has the form (b bi = Ti---b„ = T„)Bi • • • Bm(/b) comprising 
a head and a body. The head (b bi = Ti • • - b„ = T„) consists of a procedure 
name b and a list of parameter names bi • • • b„ associated with their respective 
terms Ti---T„. The body Bi • • • B^ consists of a sequence of procedure calls. 
A procedure call has either the form (b bi = Si---b„ = S„/) or the form 
(b bi = Si • • • b„ = S„)Ai • • • Afc(/b) where Si • • • S„ are the XML terms passed 
to the procedure and Ai , . . . , A^ are XML terms. The former case corresponds to 
a call to a defined procedure, whereas the latter case corresponds to a consult to 
the extensional database. In any case, unification is used for parameter passing, 
and equality comparison between documents with complex structure. 

The consistency-preserving and reactive behavior are described by EGA 
rules. Any EGA rule has the structure given in Table 2. When a collection 
of XML documents changes by inserting or deleting a document, an event E 
occurs, and then one or more rules may be selected. In this case, condition C is 
checked and if it is satisfied within the database, then the actions of deleting B 
or inserting A the argument document are executed. Ordering of actions can be 
defined by composing rules R either sequentially or in parallel. 

As an example of an intentional database. Table 3 shows the logical proce- 
dures fatherof and cousinof and the EGA rule partyinvitation. The logical 
procedures describe the father-of and cousin-of relations between two persons, 
called in the procedure the subject and the object, respectively. In the logical 
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Table 2. EGA rule structure. 



(rule )(on )E(/on)(if )C(/if) 
(do ) 

(delete )B(/delete)opt 
(insert )A(/insert) opi 
(seq )R(/seq)opf 
(par )R(/par) 

opt 

(/do) 

(/rule) 



procedure brotherof , logical variables $X and $Y are respectively associated to 
parameter names subject and object. The call to the system-defined procedure 
not equal tests whether the ground terms bound to variables $X and $Y are not 
identical. Table 3 also shows the EGA rule partyinvitation that notifies the 
good news to all cousins of having a new member. 

Table 3. A family relations document in the XML intensional database. 



(brotherof subject="$X" object="$Y" ) 

(notequal subject="$X" object="$Y" /) 

(fatherof )(father )$Z(/father)(son )$X(/son)(/fatherof) 
(fatherof )(father )$Z(/father)(son )$Y(/son)(/fatherof) 
(/brotherof) 

(cousinof subject="$X" object="$Y" ) 

(fatherof )(father )$Zl(/f ather)(son )$X(/son)(/f atherof) 
(fatherof )(father )$Z2(/f ather)(son )$Y(/son)(/fatherof) 
(brotherof subject="$Zl" object="$Z2" /) 

(/ cousinof) 

(rule name="partyInvitation") 

(on ) (insert ) 

(fatherof )(father )$X(/father)(son )$Y(/son)(/fatherof) 
(/insert) (/on) 

(if )(cousinof subject="$Y" object="$Z" /)(/if) 

(do ) (insert ) 

(invitation )(host )$Y(/host) (guest )$Z(/guest)(/invitation) 
(/ insert) (/do) (/rule) 



3.1 Rule Activation 

A rule is activated when a new document is inserted. Table 4 shows the new 
document that has the structure defined by the rule’s event section. 

An alerter agent notifies the system that an insertion event occurs. The 
alerter sends the document content to the ADM rule manager to select an appro- 
priate rule. 
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Table 4. Inserted XML document. 



(fatherof )(father )tom(/f ather) (son )tim(/son)(/fatherof) 



3.2 Rule Response 

After ADM receives the document involved in the event from the alerter, it selects 
a rule and checks the condition. If condition holds, the rule is executed. 



Table 5. Document generated by rule partyinvitation. 



(invitation )(host )tim(/host) (guest )doe(/guest) (/invitation) 



In the rule, the event section uses a document template to retrieve the re- 
quired information. The logical variables $X and $Y are used to extract the 
pieces of information that match their position with the event document. A sub- 
stitution that include the bindings $X = "tom" and $Y = "tim" is applied to 
the rule to obtain an instance of the rule. Then, a solution $Z = "doe" of the 
query (cousinof subject="$Y" object="$Z" /) is deduced. As the query 
succeeds, the action of inserting the document shown in Table 5 is performed. 

4 Operational Semantics 

4.1 Basic Definitions 

In this section a few basic definitions for XML terms are given to formally 
introduce the operational semantics of ADM. The XML term (a ai = Ti • • • a„ = 
Tjn ) • • • (/a) is normalized if its list of attribute-value pairs is lexicographically 
ordered by the name of the attribute in increasing order: a^ < &j if t < j, 
i,j € {!,..., m}. The set of substitutions E consists of the partial functions 
XMLVariable — XMLTerm defined recursively in Table 6. Computed answers to 
queries are obtained by the composition of substitutions. The null substitution e 
is the identity for substitution composition ea = ae = a. A ground substitution 
does not introduce any variable. The natural extension a : XMLTerm — >■ XMLTerm 
of a substitution defined over variables to XML terms is denoted by the same 
name. The instance of an XML term x under substitution a is denoted by xcr 
and its inductive definition on the structure of the XML term is given in Table 6. 
Instances of XML terms under ground substitutions are called ground instances. 

An unifier a for the XML terms T and S is a substitution that produces 
syntactically identical instances Tu = Sct. The most general unifier (mgu) is an 
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Table 6. Variable substitution on XML terms. 

xcr = X X (E XMLString U XMLText 
X(j = cr(x) X G XMLVariable 



(a ai — Tl • • • 3.m — /)u" — (a ai — TlU • • • ^rn — iTTiU /) 



(a ai — Tl • • • am — Tm } 




(a ai — Tin • • • am — Tmu) 


. . . T. . . 


(7 = 


. . . To- • • • 


(/a) 




(/a) 



unifier that cannot be obtained as the composition of any other. The unifica- 
tion algorithm, shown in Table 7, essentially corresponds to that introduced by 
Martelli-Montanari [7], producing either a most general unifier if it exists or a 
failure otherwise. 



Table 7. Unification of XML terms. 





(a. ai — Si . . . ajTi — S^n, ) 




(a ai — Tl . . . d.m — Tttj, ) 


A = 


El • • • E„ 


B = 


Fi • • • F„ 




(/a) 




(/a) 



(CU{A=B},cr)>(C'U{Si =Ti,...,S^ = T^,Ei =Fi,...,E„ =F„},cr) 

A = (a ai = Si . . . ^rn ~ /) B = (a ai = Ti . . . a^^ = Tm /) 

(CU{A=B},cr)>(C'U{Si =Ti,...,S^ = T^},cr) 

(CU{T = T},a) > (C,a) 

(C U {T = x}, ct) > (U U {x = T}, O') 

(C U {x = T}, o) > (Co, o{x HO- T}) x ^ r)fflrs(T) 

(C U {x = T}, o) > failure x G vars(T) 

({T^S},e) >* (e,a) 
mgu(T, S) = o 

The most general unifier is defined as the reflexive and transitive closure of 
the binary relation > C (XMLTerm x 17) x (XMLTerm x T"). Unification of XML 
terms provides an uniform mechanism for parameter passing, construction and 
selection of the information contained in the XML document. 



4.2 Deductive Database Model 

The meaning of a logical procedure is given by the interpretation of an SLD- 
resolution^ step of a procedure call [5]. Conversely, the declarative reading of 
a logical procedure establishes the validity of the call to the procedure head 
whenever all the conditions forming the procedure body are valid. An SLD res- 
olution step of the logical procedure (b bi = Ti • • • b„ = T„ )Bi • • • B^(/b) in the 
query (query )Ai • • • A„(/query) is obtained by replacing an instance of the body 



^ Linear resolution for Definite programs with Selection rule 
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Bia' ■ ■ ■ Bmc' under the most general unifier a of the first call Ai in the query 
and the head (b bi = Ti • • • b„ = T„ /) of the procedure. 

Table 8. SLD inference relation. 

(b bi = Ti • • • b„ = T„ )Bi • • • Bm(/b) G XMLProgram 
3 ct'. a' = mgu{head{Ai), (b bi = Ti • • • b„ = T„ )) 

((query )AiA 2 • • • A„ (/query), cr) ^ ((query )Bicr' • • • B„,o-'A2Cr' • • • A„ cr' ( /quer y ), crcr') 



An SLD resolution step is thus defined as the relation = 4 >C (XMLTerm x A) x 
(XMLTerm x S) that transforms the query ((query )AiA 2 • • • A„(/query), ct) into 
the query ((query )Bi(t' • • • B^cf' k 2 <j' ■ ■ • A„CT'(/query), crcr') by replacing the pro- 
cedure call Aicr' by the instance of the procedure body Bict' • • • Bmc' under ct'. 



Table 9. Correctness of SLD inference. 

((query )Ci • • • Cj, (/query) , e) -S-* ((query ) (/query), ct) 
(query )Ci • • • Cfc(/query) 



The computed answer of a query consists of zero, one or more SLD resolution 
steps beginning from the initial query. The query execution terminates when 
there are no more goals to be solved. The computed answer is obtained from the 
composition of the substitutions used in each resolution step. Table 6 states that 
the computed answer is indeed the correct answer, i.e. the substitution that is 
the solution of the query. A well known result from logic programming [5] asserts 
that a computed answer ct obtained by the SLD resolution calculus is a model 
for both the program P and the query (query )Ci • • • Cfc (/query). Therefore, the 
computed answers are those that satisfy the logical constraints of the program. 

4.3 Active Database Model 

EGA rules integrate the event-directed rule processing into the knowledge- 
based model of deductive databases. The EGA rule model closely follows the 
perception-deduction-action cycle of multiagent-based systems. 

The operational semantics of rule execution, given in Table 10, defines the 
reduction relation — (M XMLTerm x A) x (MXMLTerms x AUMXMLTerms) 
where M XMLTerm denotes multisets of XML terms, including logical proce- 
dures and EGA rules. After observing event E, if there is a computed answer 
ct' for both the extended program with the event P U {E} and the condition 
(query )Ci • • • Cfc(/query), then the collection of documents in both the exten- 
sional and intensional databases are modified by deleting the instance of doc- 
uments Bict', . . . , B„ct' and inserting the instance of documents Aict', . . . , A„ct' 
under ct'. The difference 0 and union © operators for multisets of documents are 
used instead of the corresponding operators over sets due to the importance of 
documents multiplicity. The order in which documents are deleted or inserted is 
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Table 10. Reduction relation of active rules. 



(rule ) 




(on )E(/ 


on) 


(if )Ci- 


••Cfc(/if) 


(do ) 




(delete 


)Bi • • • B^(/delete) 


(insert 


)Ai • • • A„(/insert) 


(/do) 




(/rule) 





3cr' G S.P U {E} 1=0-' (query )Ci • • • Cfc(/query) 

(P, cr) > {P e {Bicr', . . . , B„<t'} © {Aicr', . . . , A„cr'}, crcr') 



not specified. By using instance of documents under substitutions, the semantics 
captures the propagation of values among logical variables across documents. 



Table 11. Termination condition. 



(rule ) 

(on )E(/on) 

(if )Ci---Cfc(/if) 
(do ) • • • (/do) 
(/rule) 



G P 



-.3cr' G S.P U {E} \=^i (query )Ci • • • (/query) 
{P,o)^P 



The operational semantics for the termination condition of the execution of 
a rule is given in Table 11. After observing the event E that triggers a rule, the 
rule does not execute if the current contents of the XML databases does not 
entail the rule condition. In this case, the reduction relation — leads to the 
program singleton P. 

Though EGA rules are powerful, an unpredictable behavior may arise since 
the rules may interfere with each other preventing their execution. In order 
to reduce the non-determinism in the ordering of actions, the sequential and 
parallel composition of rules are introduced to schedule their execution. The 
operational semantics of sequential composition is given in Table 12. 

The first rule describes the termination condition of two rules under sequen- 
tial composition. If rule P terminates at state cr, then sequential composition of 
rules P and Q behaves like Q at the same state a. The second rule describes the 
progress condition of two rules under sequential composition. If at state a rule 
P reduces to rule P', then the sequential composition of P and Q reduces to the 
sequential composition of P' and Q. 
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Table 12. Reduction relation for the sequential composition of rules. 

{P,a)^P 

((seq )P Q(/seq), cr) > {Q, a) 

(P,a)^(P',g') 

((seq )P Q{/seq),a) > ((seq )P' Q(/seq), crcr') 

Table 13 shows the rules of parallel composition of programs. The function 
docset : MXMLTermx S — >■ MXMLTerms gets the multiset consisting of document 
instances (possibly including logical variables) under a given substitution. The 
first two rules apply only if they share a non-empty multiset of documents. In 
that case, either rule P or Q may develop its behavior but not simultaneously. 



Table 13. Reduction relation for parallel composition of rules. 

(P, cr) — > (P', a') docset{P, a) n docset{Q, a'} / 0 
((par )P Q(/par), cr) — ^ ((par )P' Q(/par), acr') 

(Q,cr) — >• (Q',a') docset(P,(j) n docset{Q,a') / 0 
((par )P Q(/par), cr) — 5> ((par )P Q'(/par), acr') 

(P, (t) — >• (P', cTi) (Q, cr) — >• (Q', CT 2 ) docset{P, a) n docset(Q, cr) = 0 
((par )P Q(/par),cr) — >• ((par )P' Q' {/^ar) ,0(7102) 

{P,o) — >P (Q,o) — >Q 

((par )P Q(/par), cr) — > P U Q 



Third rule for parallel composition applies only if rules P and Q do not share 
a collection of documents. In that case, if both rules independently exhibit some 
progress, then under parallel composition they do not interfere. Finally, last rule 
describe the termination condition under parallel composition of rules. If two 
programs P and Q terminate at state cr, the they also terminate under parallel 
composition at the same state. 

5 Conclusions 

A simple and uniform model for active and deductive XML databases has been 
proposed in this paper. The ADM language extends the XML language by in- 
troducing logical variables, logical procedures and EGA rules. An experimental 
distributed system with layered architecture has been implemented that main- 
tains the openness of the XML data by keeping apart the language extensions. 
As future work, we plan to develop analysis methods and tools to face the most 
difficult aspects of controlling and predicting rule execution. 
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Abstract. Nowadays, organizations must continually adapt to market and 
organizational changes to achieve their most important goals. The migration to 
business services and service-oriented architectures provides a valuable 
opportunity to attain the organization objectives. The migration causes 
evolution both in organizational structure and technology enabling businesses to 
dynamically change vendors or services. The paper proposes a view integrating 
the concept of networked organization & Web intelligence & Web Services into 
a collaboration environment of a networked organization. An approach to 
knowledge logistics problem based on the concepts of Web intelligence and 
Web services in the networked intelligent organization environment is 
described. Applicability of the approach is illustrated through a "Binni 
scenario "-based case study of portable hospital configuration as an e-business & 
e-govemment coalition operation. 



1 Introduction 

Nowadays, organizations must continually adapt to market and organizational 
changes to achieve their most important goals: lowering costs, expanding revenues, 
and retaining customers. The migration to business services and service-oriented 
architectures provides a valuable opportunity to attain the organization objectives [1]. 
The migration causes evolution both in organizational structure and technology 
enabling businesses to dynamically change vendors or services. 

Among forms of organizational structures the form of networked organization has 
been developed. This form denotes an organization that uses information and 
communication technologies to extend its boundaries and physical location [2]. Such 
the organization can be considered as an intelligent organization with a distributed 
network structure. The nodes of a networked organization represent any objects of the 
environment (people, teams, organizations, etc.) acting independently and forming 
multiple links across boundaries to work together for a common purpose [3]. 

Behaviour of a networked organization corresponds to the behaviour of an 
intelligent organization. The latter “behaves as an open system which takes in 
information, material and energy from the environment, transforms these resources 
into knowledge, processes, and structures that produce goods or services which are in 
turn consumed by the environment” [4]. Such the behaviour assumes a presence of 
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organization ‘intellectual abilities’, i.e., abilities to create, exchange, process and infer 
knowledge, and to learn. 

Such technologies as Web intelligence (e.g., business intelligence) provide for 
strategies enabling to implement main behavioural principles of networked intelligent 
organization. The purpose of business intelligence is an effective support of consumer 
and business processes. An attainment of this goal involves development of services 
for consumer needs recognition, information search, and evaluation of alternatives 
[5]. Web intelligence deals with advancement of Web-empowered systems, services, 
and environments. It includes issues of Web-based knowledge processing and 
management; distributed inference; information exchange and knowledge sharing [6]. 

Described in the paper approach has been developed to solve knowledge logistics 
problem. Knowledge logistics [7] addresses activities over the knowledge delivery. 
These activities concern acquisition, integration, and transfer of the right knowledge 
from distributed sources located in an information environment and its delivery in the 
right context, to the right person, in the right time for the right purpose. The aim of 
knowledge logistics has much in common with the purposes of business intelligence 
and Web intelligence. The proposed approach combines technologies of artificial 
intelligence appearing in business intelligence and Web intelligence as intelligent 
agents, profiling, ontology and knowledge management with constraint satisfaction 
problem. 

In the context of the paper the approach offers Web-based intelligent services for a 
networked organization by an example of a coalition operation support. The choice of 
coalition operations is governed by the structure and activities of cooperation 
supporting such the operations. The cooperation is made up of a number of different, 
quasivolunteered, vaguely organized groups of people, non-governmental 
organizations, and institutions providing humanitarian aid. Its activities are oriented 
on support of e-health / humanitarian operations. The cooperation structure is close to 
one of a networked organization. The cooperation is formed for problem solving by 
all cooperation members. The task of forming the cooperation and the problem 
solving itself can be considered as the above cited ‘common purpose’ . 

The rest of the paper is organized as follows. Section 0 describes a framework of 
the proposed approach. Main approach implementation features significant to 
networked organization and cooperative problem solving are presented in Section 0. 
Application of the approach is illustrated in Section 0 through a case study. 



2 KSNet-Approach: Framework 

The being described approach considers the knowledge logistics problem as a 
problem of a Knowledge Source Network (KSNet) configuration, in this connection it 
has been referred to as KSNet-approach [8]. Knowledge sources (KSs) comprise end- 
users / customers, loosely coupled knowledge sources / resources, and a set of tools 
and methods for information / knowledge processing. In the context of the paper the 
configured KSNet is thought of as a networked organization, where the listed above 
constituents of the KSNet correspond to the organization nodes. 

Since knowledge logistics assumes dealing with knowledge containing in 
distributed and heterogeneous KSs the approach is oriented to ontological model 
providing a common way of knowledge representation. The methodology addresses 
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Fig. 1. Framework of ontology-driven methodology 



user needs (problems) identification and solving these problems (Fig. 1). User needs 
are introduced by a request. The methodology considers request processing as a 
configuration of a network of KSs containing information relevant to the request, 
generation of an appropriate solution relying on this information, and presenting the 
solution to the user represented by a Web-service client (service requestor). 

At the heart of the framework a fundamental ontology providing a common 
notation lies. This is implemented through an ontology library. It is a knowledge 
storage assigning a common notation and providing a common vocabulary to 
ontologies that it stores. Main components of the ontology library are domain, tasks & 
methods, and application ontologies. All these ontologies are interrelated according to 
[9] in such a way that an application ontology (AO) is a specialization of both domain 
and tasks & methods ontologies. 

AO plays a central role in the request processing. It represents shared knowledge of 
a user (request constituent in Fig. 1) and knowledge sources (knowledge source 
constituent in Fig. 1). AO is formed through merging parts of domain and 
tasks & methods ontologies relevant to the request into a single ontology. Requested 
information from KSs is associated with the same AO that is formed for the request 
processing [7]. 

For the translation between user terms, KS terms, and the vocabulary supported by 
the ontology library request ontologies and knowledge source ontologies are used. 
These ontologies represent correspondences between terms of AO which are words of 
the ontology library vocabulary and request / knowledge source terms. 

The KSNet-approach is based on the idea that knowledge corresponding to 
individual user requirements and knowledge peculiar to KSs are represented by 
restrictions on the shared knowledge described by AO. That was the main reason to 
use an approach oriented to constraint satisfaction / propagation technology for 
problem solving. As the common notation formalism of object-oriented constraint 
networks has been chosen [7]. According to the formalism ontology is described by 
sets of classes, attributes of the classes, domains of the attributes, and constraints. 

Because of distributed structure of networked organization, its behaviour of open 
system, and orientation on the Internet as the e-business environment the technology 
of Web-services for the approach implementation has been applied. As a constraint- 
based tool ILOG tool has been chosen [10]. The system implementing the approach 
inherits its name and is referred to as the system “KSNet”. The detailed description of 
the multiagent system architecture and its functionalities can be found in [11]. Main 
features of the system which are significant for networked intelligent organization and 
cooperative problem solving are described in the next section. 
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Fig. 2. Service-based scenario for the system "KSNet" 



3 Service Model of the System “KSNet” 

The general aim on the course to Web services is to turn the Web into a collection of 
computational resources, each with a well-defined interface for invoking its services. 
Thereby Web Services can be defined as software objects that can be assembled over 
the Internet using standard protocols to perform functions or solve tasks. 

The Web service-oriented model applied to the system "KSNet" is organized as 
follows. The system acts as a service provider for knowledge customer services and at 
the same time as a service requestor for knowledge suppliers (OKBC-compliance 
knowledge representation systems). The proposed scenario is presented in (Fig. 2). 
The main specific is that the service passes the request into the system where it goes 
through all the stages of the request processing scenario. The service requestor sends 
a request to the KSNet service factory for a service creation. KSNet service factory 
defines a service provider and returns a reference to it to the requestor. After that the 
requestor sends a request for the service to the service provider and the latter interacts 
with the system “KSNet” for the purpose of the requested service. When the system 
finds an answer for the request the reply gets to the service provider and then it is 
passed to the requestor. 

The key to Web Services is on-the-fly software creation through the use of loosely 
coupled, reusable software components [12]. For this purpose the described approach 
implements adaptive services. These services may modify themselves when solving a 
particular task. For example, within the KSNet-approach there is a service attached to 
an application that is responsible for configuration problem solving based on existing 
knowledge. Upon receiving a task the application loads an appropriate AO and 
generates an executable module for its solving “on-the-fly”. 

Referring to the approach framework a request defines a task statement (goal) 
which, in turn, defines what ontologies describe knowledge relevant to the request 
and what KSs contain information to generate the answer. Knowledge relevant to the 
request is described by an AO. Thereby, a triple <G, AO, KS> where G - goal, AO - 
application ontology for the request processing, and KS - available KSs containing 
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requested information can be considered as an abstract structure. Depending on a 
particular request this structure can be refilled with information revealed from the 
request. The “on-the-fly” compilation mechanism enables ILOG code generation 
according to the filled structure. It is based on the following aspects (Fig. 3): (1) a 
pre-processed request defines which ontologies of the ontology library are relevant to 
the request and which KSs are to be used; (2) C-H- code is generated on the base of 
the filled triple <G, AO, KS>; (3) the compilation is performed in an environment of 
the prepared in advance C-H- project; (4) failed compilations/executions are not to fail 
the system functioning instead an appropriate message is generated. 

Services preceding the “on-the-fly” compilation are supported by ILOG 
Configurator. The essence of the proposed “on-the-fly” compilation mechanism is in 
writing the AO elements (classes, attributes, domains, constraints) to a C-H- file 
directly. Based on these elements a C-H- file is created and the created source code is 
inserted into an existing source code served as a template. The program is compiled 
and an executable DLL file is created. After that the function from DLL to solve the 
task is called. 



4 Case Study 

As an application domain for verification and validation of the approach a coalition 
formation problem in the Binni region was chosen. The aim of the used Binni 
scenario [13] is to provide a rich environment, focusing on new aspects of coalition 
problems and new technologies demonstrating the ability of distributed systems for 
intelligent support to supply services in an increasingly dynamic environment. The 
considered within the framework of the paper task is a mobile hospital configuration 
in the Binni region. 

Because of the space limit in the paper a request of the template-based structure is 
considered. Descriptions of the problems relating to the goal identification; the 
request, KSs and ontology vocabularies alignment; and steps on the AO composition 
go beyond the paper. The template and the case study scenario were developed based 
on results of the parsing several requests concerning the mobile hospital configuration 
task. Below, a general request is considered. Request terms defined in the template 
are italicized. 

Define suppliers, transportation routes and schedules for building 

a mobile hospital of given capacity at given location by given time. 
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The term siven generalizes values for the assumed number of patients, desirable 
hospital sites, and deadlines of the hospital formation used in the parsed requests. This 
term corresponds to the input fields of the template. 

A service requestor represented by a Web-service client sends a request to the 
system via the standard SOAP-based interface. It has been implemented using PHP 
[14] and NuSOAP Web Services Toolkit [15]. This combination enables rapid 
development of such applications as web-services 

Request terms corresponding to the structural request constituent are: suppliers, 
transportation routes, schedules, building, mobile hospital, capacity, location, time. A 
parametric request constituent consists of the values represented by the term “ siven ”. 

Parts of ontologies corresponding to the described task were found in Internet’s 
ontology libraries [16 - 21] by an expert. These ontologies represent a hospital in 
different manners. Firstly, the ontologies were imported from the source formats into 
the system notation by means of a developed tool for the ontology library 
management. The tool supports ontology import from / export to external knowledge 
representation formats [22]. After that, they were included into the ontology library, 
henceforth they can be reused for the solution of similar problems. Next, ontology 
parts relevant to the request were combined into a single ontology. Principles 
underlying AO composition are described in [23]. The resulting AO is shown in Fig. 
4. In the figure firm unidirectional arrows represent hierarchical relationships “is-a”, 
dotted unidirectional arrows represent hierarchical relationships “part-of’, double- 
headed arrows show associative relationships. Ontology part corresponding to AO 
included into the case study is represented by the shaded area. 

No ontologies corresponding to configuration tasks were found out in known 
ontology libraries and servers. An ontology for the hospital configuration task (the 
class “hospital configuration” in Fig. 4) was elaborated by knowledge engineers and 
ontology engineers. The ontology is expanded in Fig. 5. In the figure “part-of’ 
relationships between the classes are represented. In the considered example the 
method for staff definition is not taken into account as class “Staff’ related to it is not 
included in the part of the case study being under consideration. 




Fig. 4. “Mobile hospital” application ontology 
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Fig. 5. Task ontology “Hospital configuration” 



After the request enters the system a set of KSs containing information relevant to 
the request is identified. To solve the case study task the system using AO gathers 
requirement information from distributed KSs the knowledge map refers to. The 
knowledge map is an expandable knowledge storage holding references to KS 
locations. Since the problem of an automatic knowledge seeking is future research for 
the case study a list of KSs containing information for the request processing was 
prepared by an expert team. 

As all the components ILOG needs for the request processing (see the framework 
in Fig. 1) are specified the parametric constituent, AO (Fig. 4), and information 
containing in a set of knowledge sources related to this AO are processed by ILOG. 

As shown in Fig. 5 the “Hospital configuration” task has the explicit hierarchical 
structure that allows decomposing this task. In the framework of the case study the 
“on-the-fly” mechanism is used for the solving the task “Resource allocation” (choice 
of suppliers based on suppliers’ availability of commodities, suppliers’ locations, 
routes availability, etc.). The application of the mechanism enables reuse of the 
developed module even if the parametric constituent of the request or AO have been 
changed. 

In the paper as an example the solution for the subtask “Hospital allocation” is 
given. The solution is illustrated by a map of the Binni region with cities, and chosen 
transportation routes as shown in Fig. 6. The figure uses the following notations. 
Small dots are the cities of the region. The city indicated with a pin (Aida) is the 
closest to the disaster (indicated with a cross) city where the mobile hospital is to be 
built. The bigger dots are the cities where suppliers are situated and they have to be 
visited (Libar, Higgville, Ugwulu, Langford, Nedalla, Laki, Dado). Transportation 
routes are shown as lines. The lines with trucks denote routes of particular vehicle 
groups (indicated with appropriate callouts). Other lines are routes that are not used 
for transportation in the solution. Lines attached to the closed cities (indicated with 
thunderstorms next to them) are not available for transportation. 
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Fig. 6. Example results of request processing 



5 Conclusion 

The paper describes an ontology-driven approach to knowledge logistics problem and 
its applicability to a collaboration environment of a networked organization. The 
approach addresses the ontology-driven methodology for problem solving through a 
knowledge source network configuration. It is based on object-oriented constraint 
networks theory as a fundamental / representation ontology and technology of 
constraint satisfaction / propagation. Due to the incorporated in the approach 
technologies of Web Intelligence and Web Services it well suits to use in the Web- 
environment as the e-business and thereby networked organization environment. 

The considered in the case study coalition operation scenario includes problems 
similar to ones of such application domain as e-business, logistics, supply chain 
management, configuration management, e-government, etc. and thereby the scenario 
can be reapplied to different domains. 
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Abstract. Intelligent planning helps to solve a great amount of prob- 
lems based on efficient algorithms and human knowledge about a spe- 
cific application. However, knowledge acquisition is always a difficult and 
tedious process. This paper presents a knowledge acquisition and man- 
agement system for an intelligent planning system. The planning system 
is designed to assist an operator of a power plant in difficult maneuvers 
such as the presence of faults. The architecture of the planner is briefly 
describes as well as the representation language. This language is ad- 
equate to represent process knowledge but it is difficult for an expert 
operator to capture his/her knowledge in the correct format. This paper 
presents the design of a knowledge acquisition and management system 
and describes a case study where it is being utilized. 



1 Introduction 

Intelligent planning is an area of artificial intelligence (AI) that has been influ- 
enced by the increasing tendency for applications in real problems. Briefly, a 
typical planning problem is one that can be characterized by the representation 
of the following elements: 

— the state of the world, including initial and goal state, 

— the actions that can be executed in this world. 

If these representations can be made formally, then the planner is a program 
whose output is a sequence of actions that, when applied to the established 
initial state, produces the established goal state [10,4]. However, the acquisition 
and representation of all the knowledge required to solve a problem is always a 
tedious and complicated process. Specially in the application presented in this 
paper where the planning system assists an operator of a power plant in difficult 
and rare maneuvers. 

This work forms part of a larger project dedicated to the diagnosis and plan- 
ning of a power plant. The diagnosis system detects and isolates a faulty com- 
ponent that makes the process behave abnormally [3], and the planning system 
assists the operator to minimize the effects of the failure and to coexist with it in 
a safe way until the maintenance can be achieved. The maneuvers are composed 
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by several actions that the operator executes, e.g., closing valves, starting pumps, 
opening switches, etc. The state of the process is obtained utilizing sensors and 
other sources of information like personnel reports. 

The diagnosis system reports the detection of a faulty component. The plan- 
ner receives this information and designs a plan that will advise the operator 
in order to keep the process in the most useful and safe state. For example, if 
the diagnosis detects a failure in one of the temperature sensors, the operator 
may decrement the load and keep the unit generating in lower temperatures 
while the failure is fixed. On the contrary, if temperature readings are uncertain, 
then some operators may shut down the turbine in order to avoid damage. The 
opportune advice to the operator may keep the availability index of the plant. 

Consequently, the planning system must be provided with all the knowledge 
that deals with all the possible detectable failures. This knowledge includes all 
the possible actions that can be recommended, considering the current state of 
the generation process. However, this knowledge belongs to the operators and 
plant managers but they are not familiar with the notation of the representation 
language. The knowledge acquisition and management system (KAMS) represent 
a useful tool in two basic ways: 

1. it permits the acquisition of knowledge in a fill the blanks form, keeping the 
user out of the semantic restrictions of the representation language, and 

2. it maintains coherence and completeness of the knowledge base, and deter- 
mines the required variables from the real time database. 

The knowledge base is a complete set of actions that the operator can execute, 
together with the preconditions and purposes of these actions. The actions are 
represented in a language inspired in the ACT formalism, developed at Stan- 
ford Research Institute (SRI) [9]. In this formalism, the set of ACTs form the 
knowledge base from which the plan would be defined. 

Similar efforts are been developed in the international community, specially 
in the community of ontologies. For example, the Protege system [6] is a knowl- 
edge acquisition tool that produces ontologies and knowledge bases that other 
programs can read. The inference engine is the Jess system [7] (a Java imple- 
mentation of CLIPS [2] shell). Thus, Protege system carries out its duty for 
Jess while KAMS performs for the dynamic planner system developed for this 
research group. Other proposals include WebODE [1], and OntoEdit [5]. 

This paper is organized as follows. The next section briefly describes the ar- 
chitecture of the planning system. Next, section 3 describes the representation 
language in which all the knowledge is captured. Section 4 describes the de- 
sign and implementation of the knowledge acquisition and management system 
and section 5 describes the consistency checking procedures carried out by the 
KAMS. Section 6 presents one example of the use of the system in a case study 
and discusses some results. Finally, section 7 concludes the paper and addresses 
the future work in this area. 
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Fig. 1. Architecture of the planning system. 



2 Architecture of the System 

Figure 1 shows the architecture for the plan generation agent. This architecture 
has two modules: the data to knowledge mapper and the searching algorithm 
module. The mapper is formed by the production system, the working memory 
and the rules. The production system consults the value of the variables and 
triggers the corresponding rules that produce logical expressions stored in the 
working memory. This working memory represents the current state of the pro- 
cess. For example, when the variable VL65 has the value of logic one, the formula 
{main — switch state ON) can be inferred. Additionally, a set of rules can be 
included to store a theory of causality in the domain. For example, if the formula 
{main — switch state ON) is included in the working memory, then it can also 
be inferred that {400KV — bus state ON) and {main — transformer state ON). 
Thus, the knowledge base includes all the rules that may generate a complete 
theory that reflects the current state of the plant. This theory is stored in the 
working memory of the production system. Also Fig. 1 shows a goal module 
that can be considered as part of the working memory. The goal formulas are a 
determinant part of the theory that is utilized by the searching algorithm in the 
formation of a plan. 

The searching or planning algorithm uses the working memory with the state 
of the plant, the current goals and the knowledge base of the operations of the 
plant codified in the formalism described below. The searching algorithm uses 
all these elements in order to And the plan. In the first prototype of this planner, 
the planning algorithm executes an exhaustive search . The searching is made 
between the conditions and purposes of all possible actions in the knowledge 
base. The plan is simply a list of the actions that has to be followed by the 
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operator, in response to a specific current state with a specific goal. A deeper 
description of this module can be consulted in [8]. 

3 Representation Language 

The representation language utilized in this project is based on knowledge units 
(KU). The KUs are inspired in the ACT formalism, developed at Stanford Re- 
search Institute (SRI) [9]. By definition, a KU describes a set of actions that 
can be executed in order to achieve an established purpose in certain conditions. 
The following example shows how a typical piece of expert knowledge can be 
represented in one KU. 

The conversion between a gas turbine in rotating mode to the generating 

mode can be obtained closing the field generator switch 

Here, the purpose is to get the gas turbine in the generating mode. This can 
be achieved if the environment conditions include the gas turbine in the rotating 
mode. The action that can be executed is the closing of the field generator switch. 
In general, the actions are all those atomic operations than can be executed in 
the plant, e.g., close a switch. The purpose is the condition that will be present 
in the process when the actions have been executed. The conditions refer to the 
state of the process that must be present before the actions can be executed. 

In this KU formalism, all the concepts mentioned above (actions, purposes 
and conditions) are represented utilizing the following elements: logic formu- 
las, predicates and goal expressions. The logic formulas are utilized to refer- 
ence all the objects in the plant including their characteristics, properties, at- 
tributes and values. A logic formula can be expressed in two formats: {object, 
attribute, value} or {variable ID, value}. For example, {turbine state normal) 
or {V alvei 1) are logical formulas. The predicates are always applied to the logic 
formulas for the representation of actions and goals. Thus, a goal expression is 
a predicate applied to a logic formula. For example, the representation of the 
action for opening a valve can be: {achieve{GasV alve state open)) and the ex- 
pression to verify if the valve is open can be: {test{valvel 1)). This will be true 
if the valve is open and false otherwise. Here, achieve and test are the predi- 
cates applied to the logical formulas {GasValve state open) and {valvel 1) to 
indicate an action (achieve) or a verification (test). More complex expressions 
can be represented to verify certain conditions in the plant. For example, the 
expression: 

(test (and (gas-turbine state hot-standby) 

(heat-recovery state hot-standby) 

(steam-turbine state hot-standby) 

can be utilized to test if the main equipment of the plant are in the stand by 
mode, ready to work in the normal mode. 

Summarizing, the KU formalism is utilized in this project to represent all 
the needed knowledge for the planning system. The KU formalism utilizes goal 
expressions to represent actions, goals and conditions. 
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Syntactically, a KU is formed with the three elements: 
name: a unique ID for the KU, 

environment: a set of goal expressions that define the environment conditions 
that hold before and after a KU is executed, 
plot: a network of activities where the nodes represent the atomic activities that 
need to be executed to complete the KU goal. The arcs represent temporal 
relations between the different activities. 

The plot can be seen as the lowest level of abstraction of the activities carried 
out in the control of the plant. At the same time, a KU can be seen as the 
representation of activities with a higher level of abstraction. An example of a 
KU is shown in the Fig. 2. 



(TGIG 

(environment 

(cue (achieve (turbine 1 state generating))) 
(preconditions (test(turbinel state stand-by)))) 
(plot 

(tglgi (type conditional) 

(orderings (next tglgi))) 

(tglgi (type conditional) 

(achieve (Swtgl 0)) 

(orderings (next tglgf)) 

(tglgf (type conditional)))) 



Fig. 2. An example of a KU. 



The elements of the plot section are the representation of a network. The 
syntax of the nodes includes a name, a type, the ordering and other elements. 
The type can be conditional or parallel. In the parallel node, all the trajectories 
that include that node must be executed, while in the conditional node, just one 
of the paths can be executed. The orderings field indicates the temporal order 
in which the action must be executed. In the example of Fig. 2, there is an arc 
from node tglgi to the node tglgi. 

As can be seen in Fig. 2, the syntax of KUs is complicated and difficult to 
use for an expert in turbines. The next section presents a proposal to overcome 
this problem. 

4 Knowledge Acquisition and Management 

The KAMS is a system that acquires and manages knowledge. It has been de- 
signed to perform two main functions: (i) acquisition of knowledge, and (ii) 
management and validation of it. Therefore, the following functional modules 
forms the KAMS: 
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Capture panel: special windows for the capture of all the required information. 
Mapper rules: display of the captured information that will be used in the 
data to knowledge mapper. 

Real time data base: display of the identifiers of all the captured variables 
that participate in the planning process. 

Consistency checking: revision of the completeness and consistency of all the 
captured knowledge. 

The following sections describe these modules separately. 

4.1 Capture Panel 

The capture panel shows three sections: (i) the capture of the environment con- 
ditions, (ii) the capture of nodes of the plot section, and (iii) the construction of 
the rules required in the mapper. 

Figure 3 shows (in Spanish) the panel that captures the environment con- 
ditions at the left, and the plot nodes at the right. The left panel contains 
two sections. First, the capture of the post-conditions and second, the pre- 
conditions. In both cases, the conditions consists in logical formulas with the 
format {object, attribute, value}. The post-conditions allow the capture of one 
or the conjunction of several conditions. Two buttons allow adding or removing 
conditions from this section. 
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Fig. 3. Capture of the environment section and the plot nodes of the KUs. 



The preconditions section is similar to the post-conditions with the only 
difference that several conditions can be combined in other logical functions 
besides the conjunction. In Fig. 3, an KU with a precondition 

{temp2 estado falla) and {fJemp2 estado altaDif erencial) 
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and a purpose of 

{tfalla estado altaDi f erencial) 

is being captured. The right panel captures all the fields that compound the 
plot nodes, namely the action, the time required to execute the action, the 
type of node, the message that will be issued to the operator and the links to 
other nodes. The type of nodes can be parallel or conditional and normal or 
acceptation. The action executed can be expressed with the logical relation of 
logical expressions. The logical operator that can be included are and, or, not 
and arithmetic comparators as >, =, >, etc. 

A text window is included to capture the message that will be issued to the 
operator once the action should be carried out. In the Fig. 3 at the right, an 
example of a normal, conditional node is being captured. The node require to 
achieve that the variable IHMA3-CARGA reaches the state 1, and this will be 
obtained once that the operator follows the message included. In this example, 
he/she must change the control mode through the activation of a control switch 
in the console. 

4.2 Mapper Rules 

The environment conditions are logical formulas expressed in the format {object, 
attribute, value} or OAV format. They express the preconditions that need to 
be present in order to execute a KU, and the conditions that will be present 
after the KU is executed. These expressions are not directly inferred but need 
to be detected from the real variables. This is the mapper function. However, 
the user should provide with the rules that maps data to knowledge. The KAMS 
acquire all the environment conditions and the real variables of the real time 
data base, so it matches both entities to define the rules. Every environment 
condition must be included in the conclusion side of a rule whose premise is 
based on the variables of the data base. KAMS generates a table with the entire 
pairs premise-conclusion that form the mapper rules. 

For example, the condition (temp2 estado falla) require to infer its value 
based on some real variables. This can be obtained with the following rule, also 
captured in KAMS: 

IF Lttxd2 = 2, THEN temp2 estado falla 

The KAMS enunciates all the logical formulas captured in the environment 
conditions and expects that the user generates the left hand side of the rule that 
will produce the corresponding condition. 

4.3 Real Time Data Base 

When defining the mapper rules explained above, a list with all the variables is 
obtained. This list is compared with the real time data base. It is very common to 
reference variables with different identifiers. This module holds all the variables 
that will be validated in the consistency checking procedure explained below. 
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5 Consistency Checking 

The consistency checking process verifies the completeness and correspondence 
of all the elements of the knowledge base. They are: (i) Logical formulas in the 
OAV format, used in the representation of knowledge, (ii) Logical formulas in 
the {id, value} format, used in the representation of raw information, (iii) Rules, 
for the conversion between raw information and knowledge, (iv) Rules, for the 
representation of a causal theory of the domain, and (v) real time data base. 
The verification process is developed as follows. 

1. The environment conditions and the preconditions are written in the OAV 
format. This is the main knowledge acquisition activity. 

2. KAMS verifies if all the OAVs are included in the right hand side of some 
rule. 

3. Once that all the logical formulas are detected in some rule, the verification 
of the left hand side of the rules is needed. This verification is made between 
these left hand sides and the real time data base. It is a common mistake to 
spell the identifiers of variables differently in their appearances. 

Summarizing, completeness refers to the verification that all logical formulas 
are included either in the rules or in the data base. Correspondence refers to the 
equivalence of elements in the different parts of the knowledge base. The example 
followed during this paper would be checked as follows: The user captures the pre- 
condition (temp2 estado falla). KAMS then searches for this formula and finds 
the rule (IF fJtxd2 = 2, THEN temp2 estado falla). If there exists a variable 
with an id = f J.txd2 in the real time data base, then the condition can be 
accomplished. 

Once that all the elements of the knowledge base are complete and validated, 
the planner is ready to be executed in a real application. 



6 Use of KAMS 

The system was developed in a first prototype, and was tested in a simple case. 
This is, the detection of a failure in one temperature sensor of a gas turbine in 
a power plant. A brief description follows. 

The planner designed at the HE is coupled with an intelligent diagnosis 
system, also being developed in this laboratory. When the diagnosis system 
founds an abnormal behavior in the monitored variables, for example in the 
temperature sensor 2, it writes a code in the f J,txd2 variable of the real time data 
base. The planner contains a rule that explains that if the fJtxd2 variable has a 
number 2, then a condition must be issued that establishes that the temperature 
sensor 2 is in a state of failure. Once that the mapper declares this failure, then 
the planner starts building a plan based on the environment conditions of the 
Kus in the knowledge base. Figure 3 left shows the capture of the first KU in a 
plan that deals with this problem. The main action in this plan is to decrement 
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(ciclopl 

(environment 

(cue (achieve(temp2 estado correcto))) 

(preconditions (achieve (operador estado informadoFallaAD)))) 
(plot 

(esperal (type parallel) (normal) 

(time-window inf inf inf inf eps 5) 

(orderings (next ciclopl2) (next cicloplS))) 

(ciclopl2 (type conditional) (normal) 

(test (IHM_43_TEMP 1)) 

(time-window inf inf inf inf eps 600) 

(orderings (next ciclopl4))) 

(cicloplS (type conditional) (normal) 

(test (IHM_43_CARGA 1)) 

(time— window inf inf inf inf eps 600) 

(orderings (next nulo))) 

(ciclopl4 (type conditional) (normal) 

(achieve (IHM_43_CARGA 1)) 

(time-window inf inf inf inf eps 60) 

(orderings (next nulo))) 

(nulo (type conditional) (normal) 

(time— window inf inf inf inf eps 1) 

(orderings (next cicloplS))) 

(cicloplS (type parallel) (normal) 

(achieve (* operador 1)) 

(orderings (next ciclopl2) (next checafin))) 

(checafin (type conditional) (normal) 

(test (f_ttxd2 0)) 

(orderings (next cicloplf))) 

(cicloplf (type parallel) (aceptation) 

(achieve (* falla 0))))) 



Fig. 4. A KU resulted from the use of the system. 



the load in order to decrease the magnitude of the temperature problem. This 
is obtained with the plot section of the KU as the one shown in Fig. 3 right. 

Figure 4 shows an example of the use of the KAMS in the codification of 
complex KU . 

This is a ascii file that contains all the captured information in a format 
required by the planner system. First, the identifier of the KU is included: 
ciclopl. In the following three lines, the environment conditions are included. 
Third, the nodes of the plot section is produced in a format that can be normally 
utilized by the monitor of the execution [8]. Notice the difficulty for a power plant 
operator to keep track of all the parenthesis that the syntax of KU requires. 

The planner has been utilized in the capture and validation of knowledge for 
the diagnosis and corrective plans for gas turbines in power plants. 
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7 Conclusions and Future Work 

This paper has presented a knowledge acquisition and management system, de- 
signed to support the capture of the experience of an operator of a power plant. 
The knowledge was initially represented using the KU formalism. However, se- 
rious difficulties were found for the experts in the correct utilization of this lan- 
guage. Just a parenthesis not closed, caused frustration and loose of time. Also, 
the knowledge engineers discovered difficulties in the logic of the processes. 

The KAMS has been used in the capture of knowledge that is required by 
the diagnosis system of gas turbines. The diagnosis founds different failures and 
the planning system issues the correct recommendations that can take the power 
unit back to normal state or the minimization of effects due to the failures. Work 
is being done in the utilization of the system in a more complex process of the 
plant. 
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Abstract. Problems from various domains can be modeled as dynamic 
constraint satisfaction problems, where the constraints, the variables or 
the variable domains change overtime. The aim, when solving this kind 
of problems, is to decrease the number of variables for which their as- 
signment changes between consecutive problems, a concept known as 
distance or stability. This problem of stability has previuosly been stud- 
ied, but only for variations in the constraints of a given problem. This 
paper describes a wider analysis on the stability problem, when modify- 
ing variables, domains, constraints and combinations of these elements 
for the resource allocation problem, modeled as a DCSP. Experiments 
and results are presented related to efficiency, distance and a new param- 
eter called global stability for several techniques such as solution reuse, 
reasoning reuse and a combination of both. Additionaly, results show 
that the distance behavior is linear with respect to the variations. 



1 Introduction 

A Constraint Satisfaction Problem (CSP) is a problem composed by a finite set 
of variables, each of them associated to a finite domian, and a set of constraints 
limiting the values that the variables can simultaneously take. Formally [1], a 
CSP is a triple Z = (V,D,C), where V = {ui,...,w„} is a set of variables; 
D = {Dy ^, . . . , Dy^} is a set of variable domains, one for each variable, and C 
is a set of constraints, each being a pair c = (Y, R), where Y = {vn, . . . ,Vik} is 
a set of variables intervening in the constraint and R is an existing relation over 
the domains of those variables, that is, R C Dy.^ x ... x Dy.^ . A solution to a 
CSP is an assignment of a value in its domain for each variable, in such a way 
that all constraints are satisfied at the same time. 

A Dynamic Constraint Satisfaction Problem (DCSP) [2] is a series of static 
CSPs that change overtime, due to the evolution of its componenents (variables, 
domains and constraints), given certain changes produced in the environment 
(change in the set of tasks to be executed and/or of their execution conditions, 
similar to the assignment problem), a change produced by the user (if dealt 
with an interactive problem), or variations in a process (the DCSP only repre- 
sents part of a bigger problem, and a different entity solving the global problem 
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changes the way on how the solving process can be performed; for instance, a 
distributed multiagent problem) . In their original work, Dechter and Dechter [2] 
consider only modifications on the constraints of CSPs. There exist six possible 
alterations between CSPs in a DSCP: increase or decrease of variables, increase 
or decrease of values in a domain, or increase or decrease of constraints. Jonsson 
and Frank [1] present a formal definition of all these possible alterations: 

Let Z = (V, D, C) be a CSP. Any problem of the form Z' = {V , D' , C) 
such that P' 3 P (i.e. there are more variables), D'^ C Dy for each v € V 
(i.e. there are less values in the domain) and C 3 C, (i.e. there are more 
constraints between variables) is a constraint in Z. A problem of the 
form Z' = {V' , D' ,C) such that P' C P (i.e. there are less variables), 

Dy 3 Dy for each v G V (i.e. there are more values in the domain) and 
C" C C, (i.e. there are less constraints between variables) is a relaxation 
in Z. A DCSP is a sequence of CSPs Co, Ci, . . . , such that each problem 
Ci is either a constraint or relaxation of Ci-i. 

Solving a DCSP consists in finding a solution to everyone of the CSPs in 
the sequence. A way of doing this, is solving each CSP back to back, from the 
beginning, approach with two main drawbacks: inefficiency and lack of stability. 
In many cases, some of the work carried out to the previous CSP can be reused 
to solve the new CSP, since for real time applications, time is limited. With 
respect to stability, if the solution to the previous CSP represent an assigment 
currently under execution, any new solution should minimize the effort necessary 
to apply that current assignment. For interactive or descentralized problems, it 
is better to have less modifications as possible to the current solution. The lack of 
stability in the solutions (known also as distance between solutions) is a relevant 
and important problem, due to the cost in both money and effort, to modify 
the current assignment, in order to meet the new requirements; especially when, 
typically, there is no model of future events, that is, knlowledge about when 
and what modifications will appear is limited, making difficult to develop robust 
solutions. 

For tackling these problems, there exist two approaches that take advantage 
of features in a CSP to solve the next one [3]. The first one of them, called solu- 
tion reuse, uses a previous solution (if there is one) to determine the new solution. 
Verfaillie and Schiex [4] developed an algorithm called LocalChanges, inspired 
in the techniques of Backjumping (if the current variable violates a constraint, 
the assignment of the other conflicting variable is revised, not necessarily the 
previous one) and iterative repair [5] , which consists in using a previous assign- 
ment and repair it with a sequence of local modifications. The second approach, 
reasoning reuse, intends to keep an approximate description of the boundary 
of the solution space, and justifications of that boundary in terms of a set of 
constraints. Schiex and Verfaillie [6] developed a method called NogoodBuilder, 
which generates a set of nogoods from the constraints in the CSP. nogoods are 
partial instantiations of variables that cannot be extended to a solution for the 
complete set of variables. These nogoods can be added as constraints for the next 
generated CSP. 
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These models have been previously examined for variations on the constraints 
of a CSP, but, a more detailed analysis has not been discussed in the literature. 
The main interest in this paper is to analyze the performance of these methods, 
in relation to efficiency and stability, when modifying variables, domains, and 
constraints, but additionaly, proposing and analyzing new ways to integrate the 
two approaches, in order to improve the performance. 

This article is organized as follows. Section 2 establishes the methodology 
followed to analyze the various approaches for solving DCSPs. Section 3 presents 
the experimental setup and the analysis of results. Finally, in Section 4 our 
conclusions are included. 



2 Methodology 

The first step in the methdology was modeling the Resource Allocation Problem 
(RAP) as a DCSP. Further, a problem generator was implemented. Based on 
the original algorithms of solution reuse and reasoning reuse, two new hybrid 
algorithms were developed that integrate both approaches. Then, the stability 
analysis was performed on the four algorithms considering the sensibility and 
the distance parameters. 

2.1 Resource Allocation Problem 

Assigning resources to tasks is an important problem with a wide variety of 
applications (satellite telecommunications [7], dynamic routing in computer net- 
works [8]). The Resource Allocation Problem [9] consists in allotting resources 
to a set of tasks, scheduled for certain time intervals, in such a way that a re- 
source is not assigned to two different tasks at the same time. This problem is 
NP-complete [10], meaning that for problems with real dimensions is not feasible 
to find a solution in polynomial time. Formally: Variables: {ui, . . . ,Vn}', where 
Vi represents a task i from the n tasks; Domains: Dy. = {i?i, . . . where 

Rj represents the resource j from as total of m resources that can be assigned 
to task Vi; and. Constraints: Vi yf Vj if i yf j; it establishes that the tasks Vi and 
Vj cannot be at the same resource at the same time. This model was extended 
in order to include two additional constraints: > and <. So, the RAP modeled 
as a DCSP, consists in a series of CSPs definded by the previous model, each 
one may suffer alterations in variables, domains or constraints. A generator of 
DCSPs was developed, to create random instances of binary CSPs, based on the 
features of a RAP. It is possible to increase or decrease the sizes of the set of 
variables, the domains or the number of constrains or any combination of these 
three. The smallest size of a variable domain is 2 for experimental purposes. 



2.2 Analyzed Algorithms 

Along to the algorithms LocalChanges [4] and NogoodBuilder [6], two additional 
algorithms were implemented. The first of them, Lopt is based on LocalChanges, 
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and together with a method for generating nogoods implicit in the technique of 
iterative repair, it does reasoning reuse as well. Its process for generating nogoods 
prunes the search space, more than what the NogoodBuilder algorithm does. 
The second algorithm developed, called Nopt, complements the NogoodBuilder 
algorithm with solution reuse, taking advantage of its recursive nature. These 
two algorithms use solution reuse when there exist relaxations in the DCSP, and 
reasoning reuse in the case of constraints, avoiding to start all over any CSP in 
the DCSP. 

2.3 Sensibility and Distance Analysis 

The sensibility in the algorithms was measured according to the following pa- 
rameters: 

(a) Stability (Distance): Distance between two succesive solutions, that is, the 
number of variables with different asssignment between solutions of two con- 
tiguous CSPs. 

(b) Null Generation: Number of times the distance between two contiguous 
CSPs is zero. Currently, the concept of stability considers only those vari- 
ables whose assignment varies between solutions of two contiguous CSPs, 
regardless how many times such variation is present during the complete 
solution of the DCSP. Due to this, a new metric is proposed to measure also 
the frequency during the solving process of the DCSP. 

(c) Consistency Checks: Number of times an algorithm determines if a potential 
value for a variable violates a constraint with any other variable. This is a 
common parameter to measure the efficiency of algorithms. 

In order to have a better perspective in relation to the performance of the 
algorithms, and specialize them according to the types of variations that a DCSP 
can have, their sensibility to the changes was evaluated for the following modifi- 
cations in the DCSP: number of variables (V), domain size (D), number of con- 
straints (C), number of variables and domain size (V, D), domain size and num- 
ber of constraints {D, C), number of variables and number of constraints (V, C), 
and number of variables, domain size, and number of constraints (V, D, C). Con- 
sidering that in practice, the domain size of variables changes relatively slow, it 
was empirically established that the maximum number of changes between two 
CSPs is three (for any combination). A set of 30 DCSP was generated to test 
each modification. Each DCSP is formed with 20 binary CSPs. The intial CSP 
has 40 variables, domain size of 5, and 14 constraints. With this, it is expected 
an average number of changes of 2, and that in 10 out of the 20 CSP problems 
the number of variables is increased. There is a relationship between the domain 
size and the number of constraints selected for the test problems. The idea was 
to obtain a 50% of chance of solving the DCSP, that is, that at least 10 of the 
CSPs in the DCSP were solvable. After certain experimentation, it was observed 
that solving a CSP took around 850 CPU miliseconds. This time was used to set 
up maximum time limit for solving CSP^ before proceeding to the next CSPi+i. 
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The aim of the Distance Analysis is to determine the way in which the 
distance between two CSPs varies. The case of constraints between CSPs was 
taken into account, since, as mentioned in a previous section, when relaxing, the 
solution to the previous CSP is taken, and then the distance between solutions 
is zero. The distance was evaluated for the following modifications in the DCSP: 
increase in the number of variables in 2, 4, 6, 8, 10, 12, 14, 16, 18 and 20 
variables. Increase in the number of variables whose domain was modified in 
2, 4, 6, 8, 10, 12, 14, 16, 18 and 20 variables. The number of values decreased 
from a variable domain was randomly generated for each variable. Increase in 
the number of constraints in 1, 2, 3, 4, 5, 6 and 7 constraints. This modifications 
were selected intending to to have a total variation of 50% with respect to the 
initial parameters. For each of the proposed modifications, 300 DCSP, modeled 
as a RAP, were evaluated, handling contraints of the type >, <,yf. Each DCSP 
was formed with two CSPs. The initial CSP has 40 variables, domain size of 
5, and 14 constraints. The reasons for selecting these parameters are explained 
later. The effect of the distance was analyzed comparing the hybrid algorithm 
Lopt (with solution and reasoning reuse) against the LocalChanges algorithm 
(with solution reuse only). For these experiments, there was no time limit for 
solving a CSP. 

3 Experimental Results and Discussion 

The sensibility analysis explores the behavior of the algorithms with respect 
to efficiency, stability (or distance), and null generation; varying the variables, 
the domains, the constraints or any combination of them. The second set of 
experiments is oriented to the distance analysis, with the purpose to compute 
the number of assignment changes between two solutions of CSPs. 



3.1 Sensibility Analysis 

The sensibility analysis was carried out over a set of 30 randomly generated 
DCSPs. In the case of the distance, for each DCSP, it is considered an average 
of the number of times the distance was not zero. 

Efficiency. Figure 1 shows the consistency checks perfomed by the tested 
algorithms for each of the possible modifications. In general, it is observed that 
NogoodBuilder and Nopt take more consistency checks than LocalChanges and 
Lopt. This is due to the process of consistency elimination in each algorithm, 
Backjumping for NogoodBuilder and Nopt, and iterative repair for LocalChanges 
and Lopt. Lopt is the algorithm with the least number of consistency checks as 
shown in Figure 1. This reduction in the number of consistency checks is due to 
the solution reuse and reasoning reuse. The difference between NogoodBuilder 
and Nopt represents the work that Nopt is avoiding by using solution reuse, 
and that is around 50% less consistency checks than those produced by Nogood- 
builder. Lopt saves around 30% of the work done by LocalChanges due to the 
use of solution reuse. The biggest reduction is produced when reasoning reuse 
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Fig. 1. Results for efficiency in the sensibility analysis. 



is used since this technique prunes the complete tree when having relaxations 
between the CSPs in the DCSP. 

Distance. Figure 2 shows the average distance obtainde by the tested algo- 
rithms in relation to each of the modifications in the DCSP. It can be observed 
that LocalChanges and Lopt generate larger distance between contiguous solu- 
tions. It seems that the search process in each algorithm is causing this behavior, 
Backjumping for NogoodBuilder and iterative repair for Lopt and LocalChanges. 
There is a small difference between LocalChanges and Lopt, basically produced 
by the use of reasoning resue in Lopt, since using nogoods implies eliminating 
some values in the domains, and forcing to assign other values to the variables. 
This increases the distance between solutions. Instead, the solution reuse in Nopt 
does not affect its performance in comparison to NogoodBuilder, since this reuse 
avoids precisely the distance between solutions. 

Null Generation. Figure 3 presents the number of CSPs whose solution did 
not change with respect to the solution of the previous CSP. There is a slight 
difference between LocalChanges and Lopt, caused by the use of reasoning resue 
in Lopt. Let us assume that the CSP^ does not have a solution, but a partial 
instantiation of variables is generated right before failing to find a consistent 
assignment. For LocalChanges in CSP^+i, the new modification may allow the 
algorithm to assign new values to variables before the inconsistency. When com- 
paring these, it is noted that they are different and that there is a decrease in the 
null generation. However, Lopt generates nogoods that allow to determine from 
the beginning that the CSP^+i does not have a solution, avoiding any process 
to modify variables, and keeping the null generation. In average Lopt has an 
increase of 17% in the null generation with respect to LocalChanges. Nopt has 
an increase of around 14% with respect to NogoodBuilder due to the solution 
reuse. 

The sensibility analysis has shown that using a hybrid mechanism for com- 
bining solution and reasoning reuse produce better performance than using a 
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Fig. 2. Results for distance in the sensibility analysis. 




Fig. 3. Results for null generation in the sensibility analysis. 



single method. Given that solution reuse eliminates consistency checks and pro- 
duces zero distance between solutions increasing the null generation, it is an 
approach that any algorithm for solving DCSP should use. We can realize, by 
paying attention to the case of constraints, that a combination of the use of 
nogoods and a systematic approach to iterative repair would be a good option. 
A proposal for achieving this is to adjust iterative repair to start looking for 
solutions close to that established by the nogoods, and from there expand its 
search scope. Results obtained are independent of the DCSP size (20 CSPs in 
this case), since measurements are computed over contiguous CSPs, regardless 
on the number of CSPs in the DCSP. Let us consider a DCSP whose initial CSP 
has 100 variables. The behavior of the algorithms initiating with this CSP, is 
similar to that when the algorithms face the same problem that was produced 
later in the process after having initiated in a CSP, let us say, with 40 variables. 
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We can conclude also that for CSPs with higher number of variables the results 
of the analysis presented in this research are consistent. A comparable analysis 
can be carried out for the domain size and the number of constraints. 



3.2 Distance Analysis 

Figure 4 shows how the distance increases when the number of varables is in- 
creased. For instance, if the number of variables increases by 40%, the distance 
raises around 20%. In this way, for a CSP with 40 variables, if a modification pro- 
duces a new CSP with 16 new variables, in average, it could produce a distance 
of 8 variables with different value amongst the solutions. 




Fig. 4. Behavior of distance when varying the number of variables. 



Figure 5 presents the distance behavior when varying the number of variables 
when their domain was modified. In this case there is larger increase compared 
to that reported when varying the number of variables. Figure 6 illustrates the 
increase in distance when varying the number of constraints. 

By observing the last three figures, we can conlude that in general, the dis- 
tance behaves linearly for any of the three analyzed cases. This is a test that 
makes possible to determine, in terms of number of the varying elements, the 
distance that will be produced between solutions. Let us recall nevertheless, that 
these results were produced when using the RAP. It would be interesting to ex- 
periment with other problems modeled as a DSCP, and also for other algorithms. 
The fact that for LocalChanges the relationship is linear, implies that any other 
algorithm that improves on it, will have to generate a relationship with smaller 
slope, or in its best case, a constant relationship. This empirical relationship 
could be used as a heuristic within an algorithm of the kind of A* for example, 
that would allow to minimize the distance. 
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Fig. 5. Behavior of distance when varying the domain size. 




Modified constraints (%) 



Fig. 6. Behavior of distance when varying the number of constraints. 



4 Conclusions 

This paper has presented an empirical study to perform the stability analysis 
over four algorithms for solving DCSP. Results show that no algorithm obtains 
best performance for all three comparison measures. In the case of efficiency, the 
iterative repair of Lopt and Localchanges produces better results than the Back- 
jumping technique integrated in NogoodBuilder and Nopt. However, Backjumping 
produces less distance between solutions than iterative repair. In particular, it 
has been observed that the desired features in an algorithm for solving DCSPs 
are: solution reuse for relaxations, and reasoning resue in combination with it- 
erative repair for constraints. It has also been verified that an algorithm that 
produces good stability, does not always produce less distance between solutions. 
In this way, the null generation allows to select, according to the application do- 
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main, what algorithm is more convenient depending if more stability is required 
during the entire solving process or just in few occasions. Based on the distance 
analysis, it has been empirically proved that there is a linear relationship be- 
tween the increase in the distance and the variation in the number of variables, 
the domain size and the number of constraints. This relationship helps to deter- 
mine, in average, the distance between solutions, that can improve the decision 
making process if the modifications are performed manually. The analysis was 
performed over randomly generated instances of the RAP modeled as a DCSP. 
Future extensions to this work may include: perform sensibility analysis for other 
similar problems which handle different types of constraints or domain size; per- 
form a detailed analysis for establishing lower and upper bounds on the distance 
behavior; and additionaly, develop theoretical analysis for more general problem 
types and modifications. 
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Abstract. Classical constraint satisfaction problems (CSPs) provide an expres- 
sive formalism for describing and solving many real-world problems. However, 
classical CSPs prove to be restrictive in situations where uncertainty, fuzziness, 
probability or optimisation are intrinsic. Soft constraints alleviate many of the 
restrictions which classical constraint satisfaction impose; in particular, soft con- 
straints provide a basis for capturing notions such as vagueness, uncertainty and 
cost into the CSP model. We focus on the semiring-based approach to soft con- 
straints. In this paper we present a new evaluation-based scheme for implementing 
meta-constraints, which can be applied to any existing implementation to improve 
its run-time performance. 



1 Introduction 

Classical constraint satisfaction problems (CSPs) provide an expressive formalism for 
stating and solving many real-world problems. CSPs allow us to express relations over 
variables in a problem, which can be seen as declaring the allowed combinations of in- 
stantiated values for variables. In this way we can declaratively state problems and pass 
the burden of finding solutions to these problems onto the constraint solver. However, 
classical CSPs prove to be restrictive in any problems where uncertainty, fuzziness, prob- 
ability or optimisation are intrinsic. Soft constraints alleviate many of these restrictions 
which classical constraint satisfaction impose. 

We introduce the term semiring meta-constraints (constraints which depend on other 
constraints) in this paper as a useful means of referring to a class of constraints defined 
in the literature. We advocate the use of these meta-constraints to reduce the complexity 
of defining algorithms to efficiently solve soft constraint problems without relying on 
local consistency techniques, which severely limit the scope of a soft constraint solver. 
Such algorithms have been defined in the system given in [2], which are unfortunately 
highly inefficient due to the representation of meta-constraints used. 

In this paper we discuss the specification and implementation of semiring meta- 
constraints. We show how the currently dominant compilation-based conceptualisa- 
tion of meta-constraints is fundamentally flawed as it results in any algorithm which 
utilises these useful abstractions having exponential time and space complexity. We 
show how these problems can be very simply resolved by instead adopting an evalu- 
atioM-based approach to specifying and implementing these constraints. Therefore, the 
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primary contribution of this paper is the new evaluation-based scheme for implement- 
ing meta-constraints, which can be applied to any existing implementation to alleviate 
problems of unnecessary space usage. 

The paper is organised as follows. Section 2 presents the semiring framework of 
Bistarelli et al. [3] and illustrates how it can unify many disparate models of constraint sat- 
isfaction by using a semiring structure to represent consistency levels and the operations 
needed to combine and compare those levels. We describe semiring meta-constraints and 
provide some pedagogical examples of evaluation-based meta-constraints. Section 3 
reviews the existing implementations of the semiring framework. Section 4 presents 
our scheme for the implementation of evaluation-based meta-constraints and Section 5 
presents some basic results of the runtime efficiency that can be expected for evaluating 
these constraints over problems of different tightness. Finally, Section 6 summarises the 
ideas presented in this paper. 

2 Semiring Framework 

The semiring framework for constraint satisfaction is based on one key insight, that is, 
a semiring (a set together with two binary operators which satisfy certain properties) 
is all that is needed to describe many constraint satisfaction schemes. The semiring set 
provides the levels of consistency which can be interpreted as cost, degrees of preference, 
probabilities or any other criteria consistent with the requirements of the framework. The 
two operations then allow us to combine ( x ) and to compare (-f) consistency levels from 
this set. 

In the interest of brevity we will restrict our discussion of the semiring framework 
under the functional formulation [4] to a brief statement of the basic ideas involved. 
For a more detailed and rigorous treatment of the subject the reader is referred to the 
literature [1,3,4], where many key results pertaining to this framework are proven. 

2.1 Semirings 

A c-semiring (constraint-semiring) is a tuple {A, -f, x , 0, 1) such that: 

- A is the set of all consistency values and 0, 1 G A. 0 is the lowest consistency value 
and 1 is the highest consistency value; 

- -f, the additive operator, is a closed, commutative, associative and idempotent op- 
eration such that 1 is its absorbing element and 0 is its unit element; 

- X , the multiplicative operator, is a closed and associative operation such that 0 is 
its absorbing element, 1 is its unit element and x distributes over -f . 

The c-semirings for some typical instances of the semiring framework are: 

- Crisp CSP: {{false, true}, \/ , A, false, true); 

- Fuzzy CSP: ({x | x G [0, 1]}, max, min, 0, 1); 

- Probabilistic CSP: ({x | x G [0, 1]}, max, x , 0, 1); 

- Weighted CSP: {TZ'^,min, -f, -boo, 0); 

- Set-based CSP: (p(A), U, n, 0, A). 
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2.2 Constraint Problems 

Given a semiring S = {A, -I-, x , 0, 1) and an ordered set of variables V over a finite 
domain D, a constraint is a function which, given an assignment rj : V ^ D of the 
variables, returns a value of the semiring. Using this notation we define 14 = r] ^ A as 
the set of all possible constraints that can be built starting from S, D and V. 

In this functional formulation of the semiring framework each constraint is a function 
(as defined in [4]) and not a pair (as defined in [3]). Each constraint function involves all 
the variables in V, but it depends on the assignment of only a finite subset of them. For 
example, a binary constraint Cx,y over variables x and y, is a function Cx^y : V ^ D ^ 
A, but it depends only on the assignment of variables {x, y} C V . This subset is known 
as the support of the constraint. The assignment of a domain value d to a variable v as a 
modification to a particular instantiation rj is denoted by ?7[t; := d]. 

A soft constraint satisfaction problem is a pair (C, con) where con C V and C is a 
set of constraints; con is the set of variables of interest for the set of constraints. 

2.3 Semiring Meta-constraints 

In this paper we introduce the term semiring meta-constraints (or simply meta- 
constraints) as a convenient means of referring to constraint functions defined over other 
constraints in the semiring framework. Several classes of meta-constraints have been de- 
fined in the literature, including combination conshamts, projection constraints, solution 
constraints and blevel (best-level) constraints [1,3,4]. In this paper we will focus on com- 
bination and projection meta-constraints as both solution and blevel meta-constraints are 
defined in terms of these primitives. 

Combination Meta-Constraints. Given the set W, the combination function is defined 
as ({^ C)ri = HceC cr;. This function takes a set ofconstraints and returns a combination 
meta-constraint. This definition is the straightforward extension of the 0 function [4] to 
sets of constraints. 

Informally, a combination meta-constraint represents the constraint which is equiv- 
alent to all of the constraints in C combined together. This is a very useful abstraction 
as it allows us to perform all reasoning over single constraints instead of cumbersome 
sets of constraints. To evaluate a given combination meta-constraint for an instantiation 
of the variables y simply involves evaluating all of its constituent constraints under rj 
and combining the individual consistency values using the semiring x operator. 

Projection Meta-Constraints. Given a constraint c & U and a variable v € supp(c), 
the projection function jj. is defined as (c -l|(supp(c)-{i>}))?? = ^deo c’ib := This 
function takes a constraint and set of variables as parameters and returns the constraint 
which is equivalent to the original constraint with its support reduced to the specified 
set of variables. 

Informally, projecting a constraint c over the set of variables (supp{c) — {w}) returns 
a constraint c' which is equivalent to c with the variable v removed from the support. 
This is done by evaluating ct][v := d] (for the instantiation of interest) for all domain 
values d in the domain of v, and returning the sum of all of these individual consistency 
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values using the semiring additive operator +. Effectively then, the value returned from 
evaluating c'rj is the best consistency value possible for the instantiation of variables rj 
if we can choose any value for the instantiation of v. 

2.4 Example Soft Constraint Problem 

In this section we present an example soft constraint problem, defined over the semiring 
S = {TZ^ ,min, +, 0, +oo) which describes Weighted CSPs. In this problem we have 
two variables, x and y, defined over the domain D = {1,2,3, 4, 5}. In a problem of 
this type we have a set of cost functions defined over the variables of interest; each 
individual cost function describes the cost of one specific section of a configuration 
under a particular instantiation of the variables. For simplicity we define a generic cost 
function cost{a, n) = {n — a)^ to enable us to easily demonstrate the ideas in question. 

In particular, we will define three constraints denoted Cx,Cy and defined as 
follows: 



CxV = cost{2, x), 
Cyf] = cost{4:,y), 
Cx,yV = cost{l,y- x). 



Unary constraints Cx and Cy are intended to represent the costs associated with an in- 
stantiation deviating from an ideal value. For instance, the ideal value for x according 
to Cx is 2 and any instantiation where x is not set to this value will be penalised pro- 
portional to the square of its distance from this value. Binary constraint Cx,y is used to 
illustrate the idea that we can easily model complex inter-relationships between variable 
instantiations. 

The constraint problem in this example is then given by P = {{cx: Cy, Cx^y}, {x, y}). 
To allow us to demonstrate the ideas of evaluation based meta-constraints introduced in 
this paper we will give examples of combination and projection meta-constraints over 
this problem. 



Combination. In this example we demonstrate the evaluation of a combination meta- 
constraint. To evaluate a combination meta-constraint for a particular instantiation we 
must evaluate each of the constituent constraints under the instantiation in question and 
find the product of these values using the semiring multiplicative operator. 

In particular, we demonstrate the evaluation of the combination of the constraints 
Cx, Cy and Cx,y, { 2 ){ca ;5 Cy, Cx,y}, under the instantiation where x has the value 1 and y 
has the value 5 (ri[x := 1, y := 5]), i.e.. 



((2){ca;, Cy, Cx.y})rj[x := 1, y := 5] 
CxVix := l,y := 5] 

Cyii[x := l,y := 5] 



= cost{2, 1) = 1 

Xs 

= cosf(4, 5) = 1 



Xs 

Cx,yil[x ■= 1,2/ := 5] = cosf(l,4) = 9. 
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As the semiring multiplicative operator in this case is addition over reals, the overall cost 
associated with this instantiation of the variables r][x := l,y := 5] is 11. 

Projection. In this example we demonstrate the evaluation of the projection of the 
constraint over the set {x}, i.e. the meta-constraint where we remove y from the 
support of Cx,y Specifically then, we will evaluate Cx,y 'IJ'{a;} under the instantiation 
? 7 [a; := 1, y := 5], i.e., the instantiation where x has the value 1 and y has the value 5. 

To evaluate a projection meta-constraint for a particular instantiation, we must evalu- 
ate the constraint in question for all domain values of variables which have been removed 
from its support. We then find the sum of all of the individual consistency values using 
the semiring additive operator, -fs, i.e., 

{Cx,y ■■= 1,1/ := 5] = 

Cx.yVix ■= 1,1/ := 1] = cosf(l,0) = 1 

+s 

Cx,yr][x := 1, 1 / := 2] = cost{l, 1) = 0 

+s 

Cx,yT][x := l,y := 3] = cost{l,2) = 1 

+s 

Cx.yVix ■■= 1,1/ := 4] = cosf(l,3) = 4 

+s 

Cx,yV[x := 1,1/ := 5] = cost{l,4) = 9. 

As the semiring additive operator for the weighted semiring is the min function over 
reals, the result of evaluating this constraint is 0. 

One important idea illustrated in this example is the concept of the support of a 
constraint. In this example, the support of is {cc}. This means that this constraint 

depends only on the assignment of values to variable x. This is demonstrated in the 
example when we evaluate the constraint Cx,y -IJ'{a;} under the instantiation rj\x := 
1, y := 5], but we evaluate the constraint that it depends on, Cx,y, for all instantiations 
where a: := 1 and y := d. 

3 Existing Implementations 

In this section we discuss the published implementations of the semiring framework. 
There are a number of issues with these implementations: these range from limitations 
on the types of semirings that can be handled to runtime efficiency issues. 

3.1 clp(FD,s) 

In [7] the authors present an extension of the clp(FD) [5] system, clp(FD,s). This 
system provides an efficient means of solving constraint problems defined over a sub- 
set of the semirings in the semiring framework. However, no implementation of the 
combination and projection meta-constraints is provided. 

In this system, the authors explicitly restrict the scope of the solver to those semirings 
in which x is idempotent, and hence do not support the full generality of the semiring 
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framework. Many of the techniques used to gain efficiency utilise properties only present 
in semirings where the multiplicative operation is idempotent. This may seem like a 
reasonable compromise; however, this design decision prevents problems dehned over 
the Probabilistic and Weighted semirings from being solved on this system. 



3.2 SoftCHR 

In [2] the authors present an implementation of the semiring framework based on 
CHRs [6]. CHRs allow for the simplification and propagation of constraints and have 
been successfully deployed in dozens of projects to implement various crisp solvers. 
However, as propagation cannot be applied to instantiations where the multiplicative 
operation is not idempotent, the usefulness of CHRs is limited in this context. 

However, the system does provide several algorithms which can be used over all 
instances of the semiring framework, including Branch and Bound algorithms with both 
variable and constraint labelling, as well as a Dynamic Programming search algorithm. 
Unfortunately, the implementation of meta-constraints in this system severely limits the 
utility of these algorithms. 

In this system all meta-constraints are represented extensionally as a list of tuple- 
consistency pairs using the compilation-based scheme (see Section 4). Savings in space 
usage are attained by not storing tuples with consistency of zero. However, in general, 
a fc-ary meta-constraint will require exponential time and space to compile and store. 
Moreover, many of the more complex operations for this system - such as the dynamic- 
programming solver - use this operation heavily, ensuring that these operations require 
exponential time and space also. 

In the next section we present a simple method to solve this problem of exponential 
time and storage. Hopefully, this can be integrated into this system, which may allow 
the useful general purpose algorithms provided in the system be applied to non-trivial 
problems. 



4 Implementing Meta-constraints 

While a large amount of work has been published on the theoretical aspects of soft con- 
straints, apart from the two implementations mentioned in Section 3, very little has been 
published on the subject of practical implementation of soft constraints. We advocate 
the use of semiring meta-constraints as a useful abstraction to reduce the complexity of 
developing efficient algorithms to solve soft constraint problems in general. 

However, currently meta-constraints are not viable as they are both specihed and 
implemented using a compilation-based approach. By compilation-based we mean that 
when a meta-constraint function is created a lookup table of all possible input values 
and their corresponding output results is computed and stored. This approach is extraor- 
dinarily wasteful of both computing time and space. For instance, if we had a binary 
meta-constraint function over variables with domains of size twenty, we would need 
to compile a lookup table with 20^ entries. In general, if we have a compilation-based 
meta-constraint function over a set of variables V with domain D, then we will require 
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Table 1. Compilation-based function f{x, y) = x'^ + defined over domain , 5}. 



f{x,y) 


X 


y 


x^+y^ 


1 


1 


1 


1 


2 


9 


1 


3 


28 


5 


5 


150 



a lookup table with entries to fully encode the function. This means we need 

exponential time and space to construct these functions. 

For example, consider the function f{x,y) shown in Table 1. In this example we 
show a function which is composed of two functions over different variables with their 
respective results added together. This is analogous to a combination meta-constraint, 
which is a function composed of a number of separate functions over different variables 
with their results combined together using some simple operation. The variables in this 
function x and y are defined over the domain D = {!,... ,5}. Even with this tiny 
domain it is necessary to contract the lookup table which we are using for explanatory 
purposes. 



An Alternative Approach. A far more economical and simpler method of implementing 
meta-constraint functions is to simply store the original constraint functions that are 
involved and evaluate these as required with the instantiation of interest. In this way we 
can create a new meta-constraint function in constant time and with space linear in the 
number of constraints involved. This is the evaluation-b?&td method of implementing 
meta-constraints. 

One possible criticism of this evaluation-based approach is that there may be sit- 
uations where we need know the value of all possible instantiations for a particular 
meta-constraint, and furthermore, we may need to find out the value of a particular in- 
stantiation many times. However, these situations are hard to imagine and still do not 
warrant the storage of all possible instantiations. If we wish to find the value of every 
possible instantiation for a given meta-constraint we can simply iterate through all pos- 
sible instantiations and evaluate the constraint for that instantiation. If the value of an 
instantiation will be needed many times, it is the responsibility of the specific algorithm 
which requires this property to determine if it worthwhile caching the value, not the 
function which calculates it. 

Furthermore, if we make the not-unreasonable assumption that in the majority of 
constraint processing algorithms we define we will want to find the value of the least 
number of instantiations possible, the compilation scheme is highly undesirable. To sum 
up, any algorithm that we define in terms of compilation-based meta-constraints will 
have exponential time and space complexity, regardless of the semantics of the algorithm 
itself. 




186 



J. Kelleher and B. O’Sullivan 



Algorithm 1 CombinationEvaluate(? 7 ) 

CL i — 1 

for all c € C do 
a a X cq 
if a = 0 then 
return 0 
end if 
end for 
return a 



Combination Evaluation. Combination meta-constraints are an extremely useful ab- 
straction as they allow us to treat a set of constraints as a single constraint. Thus, any 
reasoning or operations that deal with constraints can be dehned over a single constraint 
as we can refer to any set of constraints by their comhination as a single constraint. This 
simplihes both theoretical and practical work with constraints. 

Comhination is a universal operation in constraint satisfaction. Any form of constraint 
processing which deals with distinct sets of constraints can all he expressed in terms 
of this operation. Therefore any improvements we make in the time or space efficiency 
of this operation will have knock-on effects on any other more sophisticated constraint 
processing that we do. 

To evaluate a combination meta-constraint defined over the set of constraints C at 
runtime for a given instantiation rj we use Algorithm 1. In this algorithm we simply 
iterate through all of the constraints in C and evaluate each one under the instantiation 
in question. To prevent unnecessary computation, we use the fact that 0 is the absorbing 
element of the x operation. In this way, we know that if any single function evaluates 
to 0 then the entire combination constraint will also evaluate to 0 and we can therefore 
immediately return 0. 

As this lazy-evaluation leverages the full generality of the semiring framework, it 
applies to all instances. For example, in the crisp semiring, this optimisation reduces to 
the lazy evaluation of the boolean AND operation; over the fuzzy semiring, it reduces 
to the lazy evaluation of the min function defined over the interval [0, 1]. 

5 Experimental Evaluation 

In this section we present some basic foundational results on semiring meta-constraints. 
In particular, in Section 5.1 we discuss results for applying the lazy evaluation pre- 
sented in Section 4 and in Section 5.2, we compare the performance of a Branch and 
Bound search for the set of best solutions using compilation-based and evaluation-based 
combination constraints. 

In both sections we use random soft constraint problems. To achieve this we follow 
the methodology adopted in [8], in which binary fuzzy CSPs are generated with four 
specific properties: the number of variables n, the number of domain values per variable 
m, the density d and the tightness t. The tightness of a problem is dehned as the ratio of the 
number of instantiations which evaluate to semiring 0 over the total number of possible 
instantiations. The remaining instantiations are then assigned a consistency value from 
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(a) Lazy Evaluation. (b) Branch and Bound search. 

Fig. 1. Experimental results for problems of varying tightness. 



the interval (0, 1] which is randomly generated with a uniform distribution [8]. To ensure 
that anomalous results are not reported, we performed ten-fold cross validation over the 
results obtained, i.e., we generated ten random problems with the required specifications 
and report the average result over all of these problems. 



5.1 Lazy Evaluation 

In the problems generated for this experiment, the number of variables is fixed at 5, 
density at 1.0 and the number of domain values is also 5. Results reported are the number 
of constraint evaluations using lazy evaluation divided by the number of constraint 
evaluations where no lazy evaluation is used. This can be seen as the gain in running 
time obtained by applying the lazy evaluation. 

Specifically, for each set of constraint problems generated, we evaluate the the meta- 
constraint representing the combination of all constraints in the problem under all pos- 
sible instantiations. Results reported are given as the ratio of the number of constraint 
evaluations required using the lazy evaluation method in Algorithm 1 and the number 
of constraint evaluations required without using this method, which is a constant for a 
problem with a given specification. These results are shown in Figure 1(a). 

If we examine Figure 1(a), we see that at low tightness levels (i.e., where the number 
of instantiations which evaluate to 0 is small), the lazy evaluation has little or no effect. 
However, as the tightness of the problems increases, the likelihood of the lazy evaluation 
coming into effect also increases, and has a significant effect on the average time required 
to evaluate a combination constraint. 



5.2 Branch and Bound Search 

For this experiment we generated random problems with 7 variables, 5 domain value 
and with density 1.0. It was necessary to use small numbers of variables and small do- 
mains as the size of the compilation-based meta-constraints required for this experiment 
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prohibited the use of larger values. The results are obtained by counting the number of 
(problem) constraint evaluations required to find the set of best solutions (i.e. the set 
of instantiations for which the semiring value is highest when evaluated over the entire 
problem) in a Branch and Bound algorithm which utilises combination meta-constraints. 
We then compare the number of constraint evaluations required when using the current 
compilation-based approach and our new evaluation-based approach. 

Figure 1(b) shows that our evaluation-based approach is never outperformed by the 
compilation-based approach. This is because the compilation approach to implementing 
meta-constraints will often compile a great deal of information which is not required for 
a specific task. This is most clearly shown when the tightness of a problem is low and 
a great number of branch cuts can be performed by the algorithm. As the compilation- 
based approach exhaustively compiles each meta-constraint, there is no benefit gained 
from these branch cuts. As each variable is instantiated in the search algorithm, the 
compilation-based approach will exhaustively generate the entire cross product for this 
variable in conjunction with all of the previously instantiated variables. On the other hand, 
as the evaluation-based approach only evaluates constituent constraints of a combination 
meta-constraint as required. Figure 1(b) shows that great savings in the number of 
constraint evaluations are obtained by utilising branch cuts. 

As the tightness increases, we see that the number of constraint evaluations required 
for the compilation-based approach actually decreases. This is due to the lazy-evaluation 
shown in Algorithm 1 which we used to compute the values when compiling the com- 
bination constraints, ensuring a fair comparison of the two methodologies. This is the 
main reason for the convergence of the two methodologies: as the number of constraint 
evaluations required to compile the meta-constraint decreases, the number of constraint 
evaluations required to find the set of best solutions increases using the evaluation-based 
methodology. 

To conclude, the time required to compile any given meta-constraint outweighs the 
benefits of constant access time which are gained by this approach, and certainly does 
not warrant the inordinate amount of space required to store them. As Branch and Bound 
is a systematic and complete search algorithm we will never need to find the value of 
a particular instantiation of the variables on a given combination constraint more than 
once; it makes little sense in this case to store all instantiation valuations. 

6 Conclusions 

Classical constraint satisfaction problems (CSPs) provide an expressive formalism for 
expressing and solving many problems in a declarative fashion. Soft constraints alleviate 
many of the restrictions which classical constraint satisfaction impose. In particular, 
soft constraints provide a basis for capturing notions such as vagueness, uncertainty 
and cost into the CSP model. In this paper we have focused on the semiring-based 
approach to soft constraints. Furthermore, we focused on some critical issues related to 
the implementation of semiring-based constraint solvers. We presented a new evaluation- 
based scheme for implementing meta-constraints, which can be applied to any existing 
implementation to improve its run-time performance. 
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Abstract. We address the problem of integrating standard techniques 
for automatic invariant generation within the context of program rea- 
soning. We propose the use of invariant patterns which enable us to 
associate common patterns of program code and specifications with in- 
variant schemas. This allows crucial decisions relating to the development 
of invariants to be delayed until a proof is attempted. Moreover, it al- 
lows patterns within the program to be exploited in patching failed proof 
attempts. 



1 Introduction 

Within the context of program reasoning, we address the problem of automat- 
ing loop invariant generation. There are two basic kinds of invariant generation 
techniques. Firstly, bottom-up analysis techniques generate inductive invariants 
by analysing program code. Secondly, top-down analysis techniques use specifi- 
cations (assertions) as the basis for generating inductive invariants. In practice, 
a third kind of analysis, what we will call proof-failure analysis, also plays a 
crucial role within invariant generation. 

We propose the use of invariant patterns as a means of achieving a effective 
integration of these three kinds of analyses. An invariant pattern represents an 
invariant schema together with a selection criteria. We build upon proof planning 
[3], a technique for automating theorem proving. In particular, we use middle- 
out reasoning [4] which supports an incremental style of invariant discovery and 
proof critics [9,11] which supports proof-failure analysis. The context for our 
work is the application of proof planning to the verification of programs written 
in SPARK [1], a programming language designed for the development of criti- 
cal software systems. SPARK is derived from Ada and includes an annotation 
language which supports flow analysis and formal proof. In §2, §3, and §4 we 
outline our general approach, while in §5, we present a detailed application. 

2 Bottom-Up Analysis 

Traditional bottom-up analysis techniques generate light-weight invariant prop- 
erties, such as relationships between loop counter variables. Such invariants are 
typically required in order to complete a proof. In addition to generating in- 
variants, our extended bottom-up analysis technique generates information that 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 190-201, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




Invariant Patterns for Program Reasoning 191 



mono_dec{V,W): Means that 1/ is a loop connter which monotonically decreases 
during the execution of loop W. 

monoAnciy, W)\ Means that 1/ is a loop connter which monotonically increases 
during the execution of loop W . 

constantly, W): Means that 1/ is a loop connter which is constant dnring the 
execution of loop W. 



Fig. 1. Meta predicates 



supports both top-down and proof-failure analyses. This involves identifying 
common patterns in terms of how program variables and data structures are 
used within an algorithm. Such patterns can be used in guiding the discovery 
of invariants. By way of illustration, invariant discovery involves identifying the 
relationship between “work done” and “work still to do” during a computa- 
tion. Within the context of array based programs, this relationship typically 
corresponds to partitioning an array, where partition boundaries are defined in 
terms of loop counter variables. Knowing how counter variables change during 
a computation provides guidance in determining the structure of invariants. We 
explicitly represent these changes by means of predicates as defined in figure 1. 
Making these notions explicit means that the information can be exploited by our 
top-down and failure-analysis techniques, as will be illustrated later. Note that 
where loops are nested an outer-loop counter will typically remain unchanged 
during the execution of an inner-loop. This notion is expressed by the predicate 
constant. It is envisaged that this set of predicates will evolve as new patterns 
between algorithms and invariants are identified. 

3 Top-Down Analysis 

Our top-down analysis technique is novel in that it generates schematic invari- 
ants. To illustrate the general mechanism, consider the following pattern of post- 
condition for an array based program: 

(V(? : int. {{I <q)A{q< u)) -)■ P{q)) (1) 

Note that I and u denote the lower and upper bounds on q respectively. Typically 
these bounds will correspond to the array bounds and the predicate P{q) will de- 
fine a property of the array. Weakening a postcondition corresponding to (1) can 
be achieved by restricting the range of q. We call this pattern of invariant range 
restriction. In order to tailor this pattern to a particular algorithm we use the 
information generated via bottom-up analysis. Let us assume that the predicate 
P specifies a property of an array t, moreover that t is partitioned with respect 
to a loop counter i with accessible range I to i. If i is monotonically increasing 
then this suggests that (1) should be weakened by replacing u, the upper bound 
on q. Determining the identity of the replacement term is a key problem. The 
conventional strategy involves generate and test, where test involves a theorem 
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prover. Here we propose the use of meta-variables, this allows us to delay the 
choice until we plan the proof. In terms of (1), this gives an invariant schema of 
the form: 



(Vg : int. {{I <q)A{q< Fi{i))) P{q)) 

Note that Fi denotes a second-order meta- variable. If the accessible range associ- 
ated with i was t to u and i was monotonically decreasing, then the I would have 
been replaced by Fi(i). Within the context of nested loops, invariant schemas 
are generated for an outer-loop before its inner-loop. An inner-loop will inherit 
the invariant schemas generated for its outer-loop. 

4 Proof-Failure Analysis 

Given a failed proof attempt, a common theorem proving strategy is to conjoin 
the failed goal onto the original conjecture (invariant) and attempt the proof 
again. We extend this strategy by introducing two alternative generalization 
steps. We describe each generalization in terms of a proof critic. 

4.1 Range Generalization Critic 

Using a “picture” notation, consider the following array: 

I I ^ ^ ^ ^ I 



L Ti T2 U 

Note that the elements indexed by Ti and T 2 are adjacent while L and U denote 
the lower and upper bounds on the array respectively. When proving a relation- 
ship between adjacent elements it is often the case that one needs to consider a 
range of elements rather than just the individuals, i.e. 



L Ti T 2 U 

Considering a range of elements provides a stronger invariant (inductive) hy- 
pothesis. Here we represent this observation as a proof critic. The preconditions 
for what we call the range generalization eritic are as follows^: 

1. A goal is unprovable within the current proof context and matches the fol- 
lowing pattern: 



ele{A, [Ti]) Rel ele{A, [T 2 ]) 

blocked 

^ Note that ele{X, [U]) denotes the value of element Y within array X. 
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where A denotes an array, terms Ti and T 2 index adjacent elements within 
the array A and Rel denotes a transitive relation. 

2. Terms T\ and T 2 contain a counter variable in common. 

Note that precondition 2 exploits the meta predicates outlined in §2. The asso- 
ciated patch involves generalizing with respect to both T\ and T 2 , i.e. 

(VX : int.{{L < X) A {X < Ti)) -A 

(Vr : int.{{Ti <Y)A{Y< U)) -a ele{A, [X]) Rel ele{A, [T]))) 

This generalized goal represents an auxiliary invariant which is then conjoined 
onto the original invariant. We envisage situations where a weaker generalization 
may be appropriate. For instance, if Ti denotes a constant then only T 2 would 
be generalized, and vice versa. 

4.2 Difference Generalization Critic 

Our second generalization critic builds upon the rippling proof plan. Rippling 
is a rewriting technique in which annotations are used to guide the selection 
of rewrite rules. Selection is based upon a difference reduction heuristic. The 
difference between a goal and a hypothesis are annotated, where the annotations 
are called wave-fronts. Annotated rewrite rules, known as wave-rules, are used 
to reduce the differences between goal and hypothesis. Rippling is successful if a 
match between the goal and the hypothesis is made possible. This matching is 
known as fertilization. A completely formal account of the ripple method can be 
found in [2,5]. Our second generalization critic is motivated by the observation 
that an unproven goal resulting from a successful fertilization often requires a 
subsequent ripple proof. This in turn involves annotating the differences between 
the post-fertilization goal and another hypothesis (invariant) within the proof 
context. This change of the “rippling focus” breaks down if the proof context 
is missing the hypothesis (invariant) that is necessary for the ripple proof to 
proceed. The patch uses the available wave-rules (background theory) to guide 
the discovery of the missing hypothesis (invariant), i.e. we look for wave-rule 
matches that fail because of missing wave-front annotations within the goal. 
The preconditions to the critic are as follows: 

1. A post-fertilization goal is unprovable within the current proof context, i.e. 

f{ 9 ic(a,b))\ 

blocked 

2. There exists a wave-rule that matches modulo missing wave-front annota- 
tions, i.e. 



g{ c{^,b) )^ h{\g{a)\) 



3. Application of the wave-rule would progress the proof planning. 
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Note that shading is used to represent wave-front annotations. Here annotations 
are missing from the goal, preventing the application of the wave-rule. With 
regards to precondition 2, a criteria for evaluating the closeness of a near-miss 
would be necessary in order to rank candidate wave-rules. Note that precondition 
3 is not essential, but further constrains the search for an auxiliary invariant by 
looking ahead into the proof planning. The associated patch involves eliminating 
the terms within the unproven goal that correspond to the missing wave-front 
annotations. In the general case this gives f{g{a)). This modified formula rep- 
resents an auxiliary invariant that is then conjoined to the original invariant. 
Where multiple sources for the missing wave-front annotations exist then al- 
ternative schemas need to be considered. Using rippling in reverse^ alternative 
sources for the missing annotations can be identified. Each alternative gives rise 
to an unique candidate invariant schema. Again information gathered during 
bottom-up analysis can be used to impose an ordering on the schemas, as will 
be illustrated later. 



5 Verification of a Bubble Sort Program 

We now apply the ideas described above to the verification of bubble sort. The 
SPARK version of bubble sort, which is verified, is given in figure 5.3. Note 
that the code is annotated with preconditions and postconditions, but no loop 
invariants are specified. Moreover, given that the code involves nested loops then 
two loop invariants will be required in order to prove partial correctness. 



5.1 Bottom-Up Analysis 

In terms of proof construction, our bottom-up analysis of the bubble sort code 
generates a couple of invariant properties. Firstly, the analysis identifies the 
bounds on loop counter I: 1 < i A i < last. Secondly, bounds on loop counter 
J are also identified: i < j A j < last. Note that the second invariant is 
with respect to the inner-loop only. These are generated by analysing the initial 
and final values of both loop counters. In terms of proof search, the following 
properties are established: 



monoJnc{i, for loop J) 


(2) 


mono-dec{j, forJoop-j) 


(3) 


constant{i, forJoop-j) 


(4) 



Note that these meta predicates are defined in figure 1. 



^ This is analogous to the induction revision critic (see [11]) where rippling in reverse 
is used to determine alternative induction schemas. 
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5.2 Top-Down Analysis 

We now turn to the specification of bubble sort. The predicate Ordered, that 
forms part of the postcondition, is defined as follows: 

ordered{A, L, U) ^ (VP : int.{L < P A P < U) ^ ele{A, [P]) < ele{A, [P + 1])) (5) 

Unfolding using (5), the Ordered predicate becomes: 

(Vp : mt.((0 < p) A (p < last)) -A- ele{table, [p]) < ele{table, [p+ 1])) (6) 

This is a candidate for the range restriction invariant pattern. From our bottom- 
up analysis of the bubble sort code (see §5.1), we identify nested loops. The 
outer-loop is associated with a single partition defined by (2), while the inner- 
loop is associated with partitions defined by (3) and (4). As mentioned above, 
we consider the outer most loop first, then the second outer most loop and so on. 
So in weakening (6) we consider the partition defined by I. By (2), we know that 
I monotonically increases during the execution of the outer-loop, which suggests 
replacing last by Pi(z) to give an outer- loop invariant schema of the form: 

(Vp : lnt.((0 <p)t\{p< Fi{i))) — >■ ele{table^ [p]) < ele{table, [p+ 1])) (7) 

This invariant schema is inherited by the inner-loop. By (4) we known that I 
remains constant within the inner-loop. As a consequence we only consider a 
partition defined by (3) as the basis for a further weakening of (7). By (3), we 
know that J monotonically decreases during the execution of the inner-loop, 
which suggests replacing 0 by Gi{j) within (7) to give: 

(Vp : int.{{Gi{j) <p) f\{p < Fi{i))) — >■ ele{table, [p]) < eleitable, [p+ 1])) (8) 

Clearly the more meta- variables that appear within a schema, the greater the 
search control problems. To minimize these problems we organize the search by 
ordering schemas according to the number of meta-variables they contain. For 
instance, schema (7) has less meta- variables than schema (8), so proof planning 
with respect to (8) will only be undertaken if the proof planning for (7) is not 
successful. 

5.3 Proof Planning and Proof-Failure Analysis 

The proof planning requires a number of attempts, where each attempt refines 
the candidate invariants. Success corresponds to the generation of a concrete set 
of invariants (proof annotations) and proof tactics for the associated verification 
conditions (VCs). 



First Proof Planning Attempt: The analysis outlined above gives rise to a set 
of schematic VCs. We focus on the Ordered predicate. Following a rippling style 
of proof, we have schematic hypothesis (7) and an annotated goal of the form: 
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First Proof Planning Attempt: The analysis outlined above gives rise to a set 
of schematic VCs. We focus on the Ordered predicate. Following a rippling style 
of proof, we have schematic hypothesis (7) and an annotated goal of the form: 



(Vp : mt.((0 < p) A (p < Fi( ^J+ 1) )))—>■ ele{table, [p]) < ele(table, [p + 1])) 

(13) 

Using wave-rules (9) and (10) rippling rewrites (13) to give: 



(Vp : int.{{0 < p) A {p < i — F 2 (yBl ))) — >■ ele(table, [p]) < ele{table, [p + 1])) 



(0 < (i — F 2 {i + 1))) — >■ ele{table, [i — F 2 {i + 1)]) < ele{table, [i — ^ 2(1 -f 1) + 1]) 



Note that as a side-effect, Fi is partially instantiated, i.e. Fi becomes \x.x — 
F 2 {x). Fertilization with hypothesis (7) would leave a residue of the form: 

(0 < (z — ^ 2(1 -I- 1))) — >■ ele{table, [i — F 2 {i + 1)]) < ele{table, [i — ^ 2(1 -I- 1) -I- 1]) 

Decomposing the implication gives rise to a new hypothesis 

0 < (1-F2(z + 1)) (14) 



and a goal of the form: 

ele{table, [i — F 2 {i + 1)]) < ele{table, [i — ^ 2(1 -I- 1) -I- 1]) (15) 

^ ’V 

blocked 



Note that this goal is blocked as no proof methods are applicable, i.e. rippling, 
simplification or fertilization. Proof-failure analysis applies the range generaliza- 
tion critic. The associated proof patch generates the following auxiliary invariant 
schema: 

(Vp : int.{{0 < p) A (p < z — ^ 2(1 -|- 1))) — >■ 

(Vg : — F 2 {i + 1) < q) A {q < last)) — >■ ele{table, [p]) < ele{table, [g]))) 

(16) 

The proof patching process is completed by conjoining (16) onto the refined 
outer-loop invariant schema, from which a revised set of VCs are generated. 



Second Proof Planning Attempt: With the refined outer-loop invariant, 
the proof context on the second proof attempt contains (16). Proof proceeds 
initially, as described for the first attempt. However, where the proof previously 
was blocked, hypothesis (16) can be specialized (using (14)) in order to prove 
(15). To complete the proof of goal (13) we need to complete the instantiation 
of the schema. To achieve this we have to exploit constraints imposed by other 
parts of the proof. Testing a candidate invariant on loop entry will typically 
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detect over generalizations^. For instance, on entry to the outer-loop I has the 
value 1 and schema (16) becomes: 

(Vp : int.{{0 < p) A {p < 1 - ^ 2 ( 2 ))) 

(\/q : — ^ 2 ( 2 ) < q) A {q < last)) — >■ ele{table^ [p]) < ele{table^ [?]))) 

This schematic goal is trivial to prove if F 2 is instantiated to be Xx.2, We return 
to the mechanization of such a step in §7. Note that the instantiated invariant 
schema asserts that the array table is partitioned such that elements below i — 2 
(inclusive) are less than or equal to the elements above i — 2. 



Third Proof Planning Attempt: We now consider the proof of the parti- 
tioned invariant discovered above. In particular, we focus on the VC correspond- 
ing to the path from the inner-loop invariant to the outer-loop invariant, i.e. 
where i is equal to j and we have a hypothesis of the form 

(Vp : mt.((0 < p) A (p < i — 2)) — >■ 

(\/q : int.{{i — 2 < q) A {q < last)) — >■ ele{table, [p]) < ele{table, [q]))) (17) 
and an annotated goal of the form: 



(Vp : mf.((0 < p) A (p < (1 + 1) —2)) 



(Vg : mt.((H^~T) —2<q)A{q< last)) — >■ ele{table, [p]) < ele{table, [ 9 ]))) 

(18) 

Using wave-rules (9) and (10) rippling rewrites (18) to give: 



|(Vp : int.{{0 < p) A {p < i — 2)) 



(Vg : int.{{i — 2 < q) A (q < last)) ele{table, [p]) < eleltable, [g]))) 



(Vg' : int.{{{i — 2) -|- 1 < g') A (g' < last)) -A- ele{table, [i — 2 -|- 1]) < ele{table, [g'])) 



Fertilization with hypothesis (17) leaves a residue which simplifies to give: 
(Vg' : int.{{i — 1 < q') A {q' < last)) — >■ ele{table, [i — 1]) < ele{table, [g'])) 



blocked 



(19) 



No proof methods are applicable so the goal is blocked. Motivated by a partial 
match with wave-rule ( 11 ), proof-failure analysis applies the difference general- 
ization critic. Given that i and j are equal, two alternative invariant schemas 
can be generated. The first asserts a notion of minimum element based upon i: 

(Vg' : int.{{i < q') A (g' < last)) — >■ ele{table, [f]) < deniable, [g'])) 

® This is analogous to testing base cases within the context of proof by mathematical 
induction in order to guard against an over generalization. 
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The second makes a similar assertion for j: 

(Vg' : < q') A {q' < last)) — >■ ele{taUe, [j]) < deniable, [g'])) (20) 

Note that from our bottom-up analysis we know that i denotes the upper bound- 
ary of a partition (see (2)) while j denotes the lower boundary of a partition (see 
(3)). Consequently, (20) is most closely aligned with the bubble sort algorithm. 
By this combination of proof-failure and bottom-up analysis, (20) is selected as 
a second auxiliary invariant. Note that (20) asserts that for the partition above 
j, the element indexed by j is the minimum. The proof patching process is com- 
pleted by conjoining (20) onto the inner-loop invariant schema, from which a 
revised set of VCs are generated. 

Fourth and Fifth Proof Planning Attempts: With the refined inner-loop 
invariant, the proof context on the fourth proof attempt contains (20). Proof 
proceeds initially as described for the third attempt. However, where the proof 
previously was blocked (see (19)), hypothesis (20) provides the basis for a simple 
rippling proof. To complete the reasoning, (20) must be shown to be invariant 
with respect to the inner-loop, again a relatively simple application of rippling 
is required. 

5.4 Summary of Invariant Discovery Results 

Bottom-up analysis generated counter variable properties which contributed to 
both the outer-loop and inner-loop invariants. Top-down analysis, constrained 
by bottom-up analysis, generated an invariant schema. Through proof-failure 
analysis, this schema was refined to give the partitioned invariant, i.e. (17). 
Proof-failure analysis also generated the minimum invariant, i.e. (20). 



6 Comparison with Related Work 

Research into heuristic rules for both bottom-up and top-down analysis have 
a long history [15,16,18,19]. The work of Wegbreit led to the development of a 
prototype system called vista [7]. The vista system, and the later runcheck [6] 
system used the strategy of conjoining failed goals onto a conjecture. The viSTA 
system was also able to extract information from failed proofs. Our proof critics 
extend theses ideas. In particular, there are two key differences between our 
approach and previous approaches. Firstly, the use of schematic invariants which 
allows an incremental style of generation (cf generate-and-test). Secondly, the 
use of program knowledge in constraining proof patches during proof planning. 

7 Current Implementation and Future Work 

Our bottom-up analysis techniques have been implemented and tested within a 
prototype called AutoGap [8]. The application of rippling presented here is not 
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new and we have a proof planner that supports the incremental instantiation 
of schematic conjectures and proof patching [10,11,12,13,17,14]. The implemen- 
tation of the proposed generalization proof critics is under-way. In terms of 
future work, the style of proof planning outlined above requires the ability to 
opportunistically switch between VCs. Moreover, the ability to exploit counter- 
examples in instantiating invariant schemas (see second proof planning attempt 
§5.3) is an area that requires further investigation. 

8 Conclusion 

We propose an integration of invariant discovery techniques. The approach relies 
upon the incremental instantiation of invariant schemas and the use of program 
patterns during the patching of failed proof attempts. Our implementation work 
is ongoing, but we believe that this work will demonstrate the synergies that can 
be achieved through the integration of static analysis techniques. 
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Abstract. In order to really understand all aspects of logic-based 
program development of different semantics, it would be useful to have a 
common solid logical foundation. The stable semantics has one already 
based on intuitionistic logic I and using the notion of completions. 
Since S4 expresses I then the stable semantics can be fully represented 
in S4. We propose the same approach to define extensions of the WFS 
semantics. We distinguish a particular semantics that we call AS- WFS 
wich is defined over general propositional theories, can be defined via 
completions using S4. Interesting AS- WFS seems to satisfy most of the 
principles of a well behaved semantics. Our general goal is to propose 
S4 and completions to study the formal behavior of different semantics. 

Keywords: Stable semantics, WFS, FOUR, Modal logics. 



1 Introduction 

A-Prolog (Stable Logic Programming [9] or Answer Set Programming) is the 
realization of much theoretical work on Nonmonotonic Reasoning and AI appli- 
cations of Logic Programming (LP) in the last 15 years. This is an important 
logic programming paradigm that has now great acceptance in the community. 
Efficient software to compute answer sets and a large list of applications to 
model real life problems justify this assertion. The two most well known systems 
that compute stable models are DLV ^ and SMODELS^. It has been recently 
provided a characterization of answer sets by intuitionistic logic / as follows: a 
literal is entailed by a disjunctive program in the stable model semantics if and 
only if it belongs to every / complete and consistent extension of the program 
formed by adding only negated atoms. [16]. We adopted the following formal 
notation to expre^this fact: M is an answer set of a disjunctive program P if 
and only if P U -iM Ihi M. 



^ http : //www. dbai . tuwien. ac . at/proj/dlv/ 

^ http : // Saturn. hut . f i/pub/ smodels/ 

R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 202-211, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 
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This logical approach provides the foundations to define the notion of 
nonmonotonic inference of any propositional theory (using the standard connec- 
tives) in terms of a monotonic logic (namely intuitionistic logic I), see [13,14, 
15,16]. Notions such as conservative extensions, conservative transformations, 
equivalence, strong equivalence (see [10,13]), among others are now better 
understood thanks to this logical approach. These notions are very important 
if one wants to push forward a logic-based program development approach. 

The well founded semantics is also a very well known paradigm originated 
at the same time that stable semantics [18]. The main difference between 
STABLE and WFS is in the definition of the former, a guess is made and then 
a particular (2-valued) model is constructed and used to justify the guess or to 
reject it. However, in the definition of WFS, more and more atoms are declared 
to be true (or false): once a decision has been drawn, it will never be rejected. 
WFS is based on a single 3-valued intended model. 

Several authors have recognized the interest in semantics with closed be- 
havior to classical logic see [4,5,6,7,17]. They have extend the WFS semantics 
by putting an additional mechanism on top of its definition. Dix noticed that 
the new semantics sometimes have a more serious shortcomings than WFS and 
hence he defined a set of principles where all semantics should be checked against 
[6] . It is worth to mention that such notions helped Dix to propose the concept 
of well behaved semantics. We think that is important to understand well such 
concept if one wants to follow any serious methodology for logic-based program 
development. We introduce an extension of WFS that we will call AS- WFS with 
the following properties: 

1. It is defined based on completions (as the stable semantics) but using the 
well known S4 logic. In our notation, M is an AS-WFS set of P if and 
only if P U -■M Il-S 4 M. This is our scenarios semantics. We can define our 
sceptical AS-WFS semantics as usu^ The STABLE semantics is defined as 
completions using I, namely P U -iM Ihi M. Since S4 can express / (using 
the Gddel translation [1]) then S4 also defines stable [14]. 

2. AS-WFS is defined for propositional theories based in basic formula. 

3. Using the knowledge ordering ( <fc, see [5]), we have that WFS < AS-WFS 
< STABLE for normal programs. The well known WFS’*' defined by Dix 
also satisfies this property. 

4. AS-WFS can also be defined using modal logic S5. Moreover, AS-WFS can 
also be defined using the well known billatice FOUR. The stable semantics 
can also be defined using FOUR [3]. 

5. The known counter examples for the well behavior of several known exten- 
sions of WFS (such as GWFS and EWFS) do not apply for AS-WFS. It 
seems, but we do not know yet, that AS-WFS satisfies several of the princi- 
ples given for well behaved semantics [2]. 

6. AS-WFS is different from GWFS, EWFS, WFS+. 
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We expect the reader to have some familiarity with modal logics, many valued 
logics and logic programming. 

2 Background 

We consider a formal (propositional) language built from an alphabet containing: 
a denumerable set £ of elements called atoms, the standard 2-place connectives 
A, V, — >■, and the 1-place connective -i. Formulas and theories are constructed as 
usual in logic. In this paper we only consider finite theories. We will later define 
other connectives, but only for temporal use. 

We define the class of basic formulas recursively as follows: 

-<a, a if a is an atom. 
a V /3 if a, (3 are basic formulas. 
a A (3 if a, (3 are basic formulas, 
a — >■ (3 if a, /3 are basic formulas. 

A normal program is a set of rules of the form 

Ai A ... A Ajn A —'A^^i A ... A ~<An — >■ Aq 

We use the well known definition of a stratified program, see [12]. We use 
the notation hx F to denote that the formula F is provable (a theorem or 
tautology) in logic X. If T is a theory we use the symbol T hx £ to denote 
bx {Fi A • • • A Fn) -A F for some formulas Fi G T. We say that a theory T is 
consistent if it has a model in the given logic. We also introduce, if T and U are 
two theories, the symbol T \~x U to denote that T hx £ for all formulas F G U. 
We will write T Ihx U to denote the fact that (i) T is consistent and (ii) T hx U. 

Given a class of programs C, a semantic operator Sem is a function that 
assigns to each program P G G a set of sets of atoms M C Cp. These sets of 
atoms are usually some “preferred” two valued models of the program P each of 
them is called a Sem model of P. Sometimes, we say, that this is the scenarios 
semantics [7] . Given a scenarios semantics Sem, we define the sceptical semantics 
of a program P as: Sem(P) = {^{{M U ->M : M is a Sem model of P}, where 
M = Cp\M and ~^M = {-^a : a G M}. 

Given two semantics Si and S' 2 , we define: Si < S 2 if for every program P 
is true that Si{P) C 5*2 (P). We can easily define = S 2 and S'! < S' 2 - We say 
that a semantics is stronger than another one according to this order. 

2.1 Non Monotonic Reasoning via S5 

Gonsider modal logic S5 with its standard connectives that we will denote as: ~ 
, — V, A. McDermott and Doyle introduced a non-monotonic version of S5. They 
define the X-expansions P of a theory T as those sets satisfying the equation: 

P = G„,(TU{~D^^P}) (1) 
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where Cnx is the inference operation of the modal logic X. Depending on the 
approach, an arbitrary selected X-expansion for T or the intersection of all X- 
expansions for T is considered as a set of nonmonotonic consequences of T. 
McDermott proved that S5 coincides with its non-monotonic version, hence it is 
not very interesting. However, he considered formulas to complete the theory. If 
we only consider adding simple formulas of the form ~ Da (a an atom), the story 
changes. In fact, Gelfond [8] was able to characterize stable models of stratified 
normal programs using this idea and the following translation: A normal clause: 

Ai A ... A Am A -<Am+i A ... A ~'An —> Aq 

becomes 

Ai A ... A A„A ~ OAm+i A ...A ~ ClA„ — >■ Aq 



3 Definition via S4 

Consider modal logic S4 with its standard connectives that we will denote as: 
— >■, V, A. 

Let -iQ; be the abbreviation form of the modal formula ~ Da. Gelfond in [8] 
gives a definitions of similar semantics to AS- WFS, but it covers only the class 
of stratified programs, we generalized this concept in the following definition: 

Definition 1. Let P he a theory based on basic formulajind M he a set of atoms. 
We define M to be an AS-WFS model of P iff P\J^M Il-S 4 M. We denote the 
sceptical semantics of P as AS-WFS(P). 

Hence, this definition opens the research line of defining other WFS exten- 
sions via different modal logics. For example, modal logic K behaves ‘closer’ 
(but still different) to the stable semantics. Consider the following example: 
-■a — >■ 6, — >■ a, -•p a,^p ^ p. Then AS-WFS has two models, namely 
{a,p}, {b,p}. But using modal logic K we obtain no models. 

If we consider S5 instead, we obtain the same semantics. 

Lemma 1. Let P he a theory based on basic formula and M he a set of atoms. 
Then M is an AS- WFS model of P iff PA ~^M Ihss M . 

(See proof in Appendix). 

Is well known that STABLE and WFS agree in the class of normal stratified 
programs. Gelfond showed [8] that the stable models of stratified programs can 
be characterized using S5 completions under his proposed translation. Hence, 
we have the following result. 

Corollary 1. Lf P is a stratified normal program, WFS(P)= AS-WFS(P)= 
STABLE(P). 
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4 Results 

We first show that AS-WFS is different to some well known semantics and then 
present its characterization using S4. We will also introduce some families of 
semantics and finish this section presenting a brief comment on well behaved 
semantics. 



4.1 Comparing AS-WFS with Other Semantics 

Consider the EWFS semantics, the CUT rule and the following example all of 
them taken from [5]: 



-<a — >■ a, ->x A a — >■ 6, -'6 — >■ y, -ly — >■ z 

Here EWFS(P) = {a,b,~>x}, however EWFS(P U {6}) = {a,b, z,->x,~'y}. 
This example shows that EWFS does not satisfies CUT. AS-WFS (P) = 
{a, 5, z, ->x, -<y} as well as AS-WFS (PU{6}) = {a, b, z, -^x, ~<y}. Hence AS-WFS 
is different to EWFS. 

Consider the following two program examples taken from [5]: 

-■6 — >■ p, c — >■ 6, (p A -lo) — >■ c, -'6 — >■ a 

and 



-i) — >■ p, (p A -lo) — >■ 6, -'6 — >■ a 

One may expect the same semantics of both programs w.r.t. the common lan- 
guage. However, GWFS infers p in the first program, but it does not in the sec- 
ond program. AS-WFS gives the same answer in both programs which consists 
in deriving only -ic in both programs. Hence, AS-WFS is different to GWFS. 
Consider the following program example taken from [7]: 

-<b — >■ a, -'O — >■ &, -la — >■ X, -i6 — >■ a; 

Note that WFS+(P) = {}, but AS-WFS(P) = {x}. 

In [7] we have that WFS“*" is an stronger extension of WFS, then comparing 
the semantics we proposed and the results obtained in [5], we have that WFS < 
AS-WFS < STABLE. This can be formalized as follows: 

Lemma 2. Let P he a normal program, then WFS{P) < AS-WFS(P) < STA- 
BLE (P). 

4.2 Well Behaved Semantics 

We conjecture that AS-WFS satisfies all principles involved in the definition of 
a well behaved semantics as long as we reject to interpret P U M as P^ . Note 
that the notion P^ is a syntactic transformation, not required when P U M 
has a logical meaning. Take for instance, the following program P from [6]: 
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b ^ a,^a ^ b. This example is used to show that WFS+ does not satisfies 
the Extended Cut principle. While WFS+(P) = {a, -•&}, WFS+(PU {-■6}) = 
{-■a, -■&}. Moreover, {-■a, -■&} is neither a 2-valued model, nor a 3-valued model 
of the program P U {“'6}. However, this happens because WFS+ does not have 
a “logical” definition for the semantics of programs extended with constraints 
(negated formulas). Hence, Dix interprets P U {“'6} as ■= b ^ a. 

Now {-lO, -i6} is a model of P U {“'6}. AS- WFS has a definition for semantics of 
any basic propositional theory that allows the use of constraints. In this example 
we get AS-WFS(P) = ASP-WFS(P U {-■6}) = {a, -■6}. Hence, we propose to 
reconsider Dix’s work on well-behaved semantics, towards a direction of making 
it more general and logical based. 

5 Characterization of ASP- WFS via FOUR 

The logical role that the four-valued estructure has among Ginsberg’s well known 
bilattices is similar to the role that the two- valued algebras has among Boolean 
algebras. Four valued semantics is a very suitable setting for computerized rea- 
soning acording to Belnap and in fact the original motivation of Ginsberg for 
introducing bilattices was to provide a uniform approach for a diversity of ap- 
plications in AI. Bilattices were furter investigated by Fitting, who showed that 
they are useful also for providing semantics to logic programs, hence our interest 
is focus on relating FOUR with ASP- WFS. 



5.1 The FOUR- Valuation Bilattice 

Belnap introduced a logic for dealing in an useful way with inconsistent and 
incomplete information. This logic is based on a structure called FOUR, see 
[3]. This structure has four truth values, the classical t and /, and two new T 
that intuitively denotes lack of information (no knowledge), and _L that indi- 
cates inconsistency (“over”-knowledge). These values have two different natural 
orderings. 

— Measuring the truth: 

The minimal element is /, the maximal element is t and values T and _L are 
incomparable. Here we have the inverse involution the meet and join 
operators denoted respectively as Atr and \/tr- 

— Reflecting differences in the amount of knowledge or information: 

The minimal element is _L, the maximal element is T and values / and t 
are incomparable. Here we have inverse involution the meet and join 
operators denoted respectively as Afc„ and Vfc„. 



5.2 The AS-WFS Semantics 

We first explain how are we going to use FOUR to define our semantics. 

We read the bilattice FOUR identifying _L as 0, T as 3, / as 1 and t as 2. We 
have the following tables for the connectives ~ and — >■, and if we give the other 
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table as a valuation for the assertion connective □. Later, will become more 
clear why are we selecting a “typical” modal operator symbol for this purpose. 

The idea of this last connective is due to the Russian logician Bochvar. It 
intends to represent the “external assertion” of a proposition p, that is. Dp can 
be considered as the assertion “p is true” in a two valued metalanguage. We 
define the operators □ as follows: 



A 


~ A 




0 
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2 


3 
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DA 
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“ 0 “ 
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X 
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It is important to note that these connectives are abbreviation forms using 
the standard language of FOUR as it is shown in the following table. 
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As before, we define our main negation operator (- 1 ): -<p as ^ Op. 

Using the reading of FOUR and the valuation defined in the tables tautologies 
will be the formulas whose truth value is 3. Examples of some tautologies are: 
{-<a a) ^ a, aV ~ia, ->->a a, a ^ a) . Note that a — >■ -'-'O is not a tautology. 



Theorem 1. Let P he a theory based on basic formula and M he a set of atoms. 
M is an AS- WFS model of P ijf PL) ~^M IhpouR M . 

(See proof in Appendix) 



6 Conclusions 

There is still actual interest in extensions of WFS ([4,7]). We need however to 
find a logical framework to define such extensions if one really believes in a 
logic-based program development approach. We propose an approach to define 
extensions of the WFS semantics based on completions with the same spirit of 
STABLE, hence closing the gap between both approaches. As a result, we gain 
a better understanding of those semantics as well as the relation among them. 
We have that AS- WFS is sound with respect to stable models semantics and 
it can be used to approximate stable entailment. Still it is left work to do with 
respect to this semantics and our future work is to continue going deep in this 
semantics to see what properties of the well-behaved semantics it satisfies. 
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Appendix 

We define a, K — basic formula as the modal formula result of the translation 
of Gelfond [8] of a basic formula. Then if a is a K-basic formula the scope of the 
modal operator □ will be only atoms. 

Proof Lemma 1 . 

P U ->M II-S4 M i.e. I-54 /\{P U -iM) — /\M. As P is a theory based on 
basic formula and M be a set of atoms, then /\{P U -iM) /\M is a K — 
basic formula. 

If a a K-basic formula then hs4 a iff b 55 a. To prove it, the sufficiency 
follows immediately, so it is enough to check necessity. Suppose /l“54a then 
exists M. = (R,S,V) model of S'4 such that Ai /f=a. Let be A4' = (R',S,V) 
where R' = i?U{(a;, x) \ x € S'}, then M' is a model of S5. By a direct induction 
over the number of connectives of a we have that Ai' /|=q; i.e. /-55a. Then 
hs5 a implies hs4 a. 

Logic S52 is constructed by adding to logic S5 the following axiom: 

F2 : O4I4 A OA2 A A A2) — )■ V A2) 



Proof Theorem 1 . 

Applying Lemma 1 we have P U ~<M Ihss M, equivalent to I-55 /\{P U -•M) — >• 
/\ M where /\{PL)-'M) — >■ /\ M is a K-basic formula. We have the following two 
lemmas: 

Lemma 3 . If a is a K-basic formula then I-55 a iff I-552 a. 

(See proof in Appendix). 



Lemma 4 . Let be a a modal formula I-552 a iff \=four 01. 

(See proof in Appendix). 

Then by Lemmas 3 and 4 \=four A(^bl-'M) /\M and PU-'M IbpouR M. 

Proof Lemma 3 . 

We first define the model Ai^ based on a model Ai= (R,S,V) and t a fixed 
element of S. We suppose that S has at least two elements. 

We define: 

S' = S/{t} = {{0, T} where P = {s g S|s yf t} 

R' = S' X S' 

{ S' if\/rGS Vr(p) = true 
{{t}} if y{p) = {t} and 3r G T Vr{p) = false 
{T} if t i V{p) and 3t' G T t' G V{p) 

0 otherwise 
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Then M'= {R', S', V) is a model of 8^2 by construction. We have the following 
corollary about the relation between this new model and models of 55 . 

Corollary 2. Let be A4 a model of S5 with at least two elements and s a fixed 
element of S and a a K-basic formula then: Ai\=s a iff J^' s l={s} 

The proof is a direct induction on the size of a. 

Sufficiency of our Lemma 3 follows immediately. Let us check necessity: 
Suppose /-S 5 a and I-552 a. Then exists M a model of 55 (with at least three 
elements, otherwise the proof is trivial) such that /|=a then exists s G 5 
such that then by the Corollary 2 M'g I-552 a and M'g is 

a model of 552 then Ai'g H{s} which leads to a contradiction. 

Proof Lemma 4. 

Let A=< {1,2}, {1,2} X {1,2} > and M =< {1,2} x {1, 2}, {1, 2}, C >. 
Suppose that 7 G Form{(j)), let = {oi, 02, ..., a^} C (f atoms of language and 
any valuation V : 6 ^ of the atomic formulas, we define q : {0, 1, 2, 3} —>■ 

{0f{l},{2},{l,2 }}as 

g(O)=0 5(2) = {2} 5(1) = {1} 5(3) = {1,2} 

Then we have the tetra-valuation VpouR defined as follows 

1. If a is an atom Vfour{o) = g~^{V{a)) 

2. If a is not atomic VpouR is defined recursively over the valuation of the 

operators □ given in section 5.2. 

Of the above definitions we have the following lemma about the relation 
between the valuation and the frames. 



Corollary 3. Let be VpouR the valuation FOUR extended to the modal 
operators as we have defined then Vfour{o:) = * iff ^ 



In other words, if we denote Vfour{o:) = 3 as \=four ce we have that: 
\= a iff \=FOUR Oi- 
ks 552 is determined by the class of frames based in a set with two elements 
and a reflexive, transitive, symmetric, serial and euclidean relation then 
a if f T \= a. Then if Vfour is the valuation FOUR extended to the modal 
operators as we have defined a if f \=four o: 



3 



The proof of this corollary is available by request via e-mail. 
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Abstract. Many applications have shown that the combination of specialized 
reasoning systems, such as deduction and computation systems, can lead to syn- 
ergetic effects. Often, a clever combination of different reasoning systems can 
solve problems that are beyond the problem solving horizon of single, stand-alone 
systems. Current platforms for the integration of reasoning systems typically lack 
abstraction, robustness, and automatic coordination of reasoners. We are currently 
developing a new framework for reasoning agents to solve these problems. Our 
framework builds on the EIPA specifications for multi-agent systems, formal ser- 
vice descriptions, and a central brokering mechanism. In this paper we present the 
architecture of our framework and our progress with the integration of automated 
theorem pro vers. 



1 Introduction 

Automated reasoning systems^ have reached a high degree of maturity in the last decade. 
However, reasoning systems are highly specialized and, typically, they can only solve 
problems in a particular domain such as, for instance, proof by induction, reasoning on 
first-order logic with equality, or computation in group theory. Many case studies have 
shown that the combination of reasoning specialists can help to solve problems that are 
beyond the problem solving horizon of single, stand-alone systems (see, e.g., [1,2]). 

In our research group we have developed the MathWeb Software Bus [3] for the 
integration of heterogeneous reasoning systems. The main idea behind the MathWeb- 
SB is to extend existing reasoning systems, such as automated theorem provers (ATPs), 
computer algebra systems (CASs), model generators, and constraint solvers, with a 
generic interface that allows them to communicate over a common software bus. The 
MathWeb-SB has proven very successful for the integration of reasoning specialists 
on the system level [4,5], and is in everyday use in different research groups. Despite 
this success, problems occurred in certain applications that are hard to solve without 
fundamental changes in the architecture of the MathWeb-SB. Among others we faced 
the following problems: 

* The author is supported by the CALCULEMUS IHP grant HPRN-CT-2000-00102. 

* Throughout this paper, the term reasoning system denotes deduction systems as well as symbolic 
computation systems. 
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1 . The access of reasoning systems has to be performed on the system level, i.e. devel- 
opers of client applications for the MathWeb-SB still have to know which reasoning 
system is suitable for a problem at hand, and how to access the system. 

2. Many problems cannot be solved by one single reasoning system but only by a co- 
ordinated interplay between different systems. A coordination of reasoning systems 
by the MathWeb-SB is not possible. Thus, client applications have to coordinate the 
systems needed to solve a problem. 

3. The client-broker-server architecture of the MathWeb-SB is not designed for asyn- 
chronous communication. Synchronous communication is not flexible enough for 
modern applications of distributed reasoning. 

To overcome these problems, we are developing a new framework for distributed au- 
tomated reasoning based on agent-oriented programming. In our framework the capa- 
bilities of reasoning systems are described in the Mathematical Service Description 
Language (MSDL) [6], an XML language developed in the MONET project [7]. The 
central agents in our framework are brokers that act as middle-agents between the service 
providing agents (the reasoning agents) and the service requesting agents. Our brokers 
will reason on MSDL service descriptions to find suitable sequences of available services 
to tackle a given problem. 

For our framework we use existing standards wherever possible. In particular, we 
employ the specifications of the Foundation for Intelligent Physical Agents (FIFA) [8] 
for the interaction and coordination of agents. We use the languages OpenMath [9] and 
OMDoc [10] for the encoding of mathematical content. Furthermore, we use OMDoc 
and the language TSTP [11] for theorem proving problems and proofs. For the dehnition 
of ontologies we use the Web Ontology Language (OWL) [12]. 

In [ 1 3] we described hrst ideas for our framework. Since then we have made progress 
in integrating hrst-order ATPs as reasoning agents. We can now describe the service 
offered by these agents in MSDL using an ontology we have developed. We also im- 
plemented a prototypical broker which analyzes a given theorem proving problem and 
chooses the best available prover to tackle the problem. 

The remainder of this paper is structured as follows: In section 2 we present the 
overall structure of our reasoning agent framework, our ontology for the description of 
theorem proving services, and the description of first-order ATP services with MSDL. 
In section 3 we show an example for advanced brokering of reasoning services. We 
conclude and discuss some related work in section 4. 



2 A Framework for Reasoning Agents 

Our work on the brokering of reasoning services can be split into three major parts: 

1) The development of the system of reasoning agents and a brokering mechanism, 

2) the development of an ontology for reasoning service descriptions, and 3) MSDL 
descriptions of the services needed for our case-studies. In this section we describe our 
work on these issues. 
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2.1 The Agent Platform 

As we have already mentioned in section 1 we base our reasoning agents, the commu- 
nication between them, and the encoding mathematical content on existing standards. 
The Foundation for Intelligent Physical Agents (FIPA) has produced many specifications 
for the inter-operation of heterogeneous software agents. FIPA specifies the agent com- 
munication language FIPA-ACL, the communicative acts between agents, and several 
interaction protocols for agents, such as the Contract Net Protocol^. For our work we 
employ the Java Agent DEvelopment Framework (JADE) [14] which is a widely used 
implementation of the FIPA specifications. 

Reasoning agents are specialized JADE agents which encapsulate reasoning systems 
and manage conversations with other reasoning agents. They translate incoming FIPA- 
ACL messages into calls of the underlying reasoning system and wrap results into 
response messages. Reasoning agents advertise MSDL descriptions of their capabilities 
to a local broker which stores all advertised services in its service directory. 




Fig. 1. Reasoning agents informing a central reasoning broker about the services they offer, and 
the Proof Assistant Agent Dmega asking for a proof of a conjecture 



Fig. 1 shows a scenario in which several agents inform a local broker about the 
reasoning services they offer using the communicative act inform of the FIPA-ACL. 
For instance, the agent ATP Agenti, which encapsulates the theorem prover SPASS, 
informs the broker that it offers the first-order theorem proving service SpassATP (see 
section 2.3). The proof assistant JJmega [15] does not offer any service but uses the 

^ These protocols will become interesting in a later state of our project where we intend to 
investigate whether they can be used for the coordination of distributed reasoning agents. 
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query-if performative to send an open conjecture (Proving-Problem(/^ h -ip)) to the 
broker asking whether the conjecture holds. JADE agents are grouped in agent containers. 
We are planning to start one broker in each agent container because of our positive 
experience with dynamic networks of brokers [3]. At the moment we only work with 
one broker in one agent container. However, in the future, brokers in different agent 
containers will connect to each other to exchange information about available services. 

Every agent can send queries, i.e. open reasoning problems, to the broker of their 
agent container. Currently, our broker only accepts first-order theorem proving problems. 
In [16] Sutcliffe and Suttner distinguish six important features of first-order problem 
(e.g., whether the problem contains the equality predicate or not). Our broker analyzes 
incoming proving problems according to these features and annotates the problem with 
the result of this analysis. Then the annotated problem is matched against the available 
services. Up to now, the broker uses a very simple matching based on the tuProlog 
engine [17]. In the future the broker will employ a more sophisticated reasoning process 
which is capable of combining several services to tackle a problem. This is also going 
to incorporate reasoning on our ontology (cf. section 2.2). In section 3 we describe how 
reasoning on service descriptions could be used to solve the proving problem sent by 
IOMEGA (see Fig. 1). 

At first sight, our brokering mechanism seems to be similar to Polya’s four phases 
of problem solving [18]. But, as opposed to Polya, our broker does not have a “under- 
standing” of a problem in the sense of an intelligent mathematician. Up to now, it only 
finds syntactical features of problems. However, some of Polya’s ideas, such as theorem 
lookup (in a database), the use of analogy, and independent proof checking, are also 
interesting for brokering as it has been described above. 

Different deduction systems typically rely on different logics and consequence rela- 
tions. Therefore, we are investigating the possible use of the LF logical framework, and 
its implementation in the TWELF system [19] as a logical basis for brokers. Different 
logics and calculi can be encoded in TWELF’s type theory and the system offers partial 
transformations of proofs from one deductive system to another. These transformations 
might become useful in future applications of our framework. In particular, a combina- 
tion of LF with the TRAMP tool [20] described in section 3 would be very useful for 
our work and other research projects. 

2.2 An Ontology for Reasoning Service Descriptions 

We are currently developing two ontologies: 1) a brokering ontology which is used in 
service advertisements sent to the broker, and 2) a reasoning ontology which is used in 
MSDL service descriptions and problem descriptions. In this paper we focus on the latter. 
We decided to use the Protege-2000 Tool [21] which, lately, supports the development of 
ontologies in OWL. The Protege tool is particularly useful for our purposes as it allows us 
to automatically generate ontology classes for JADE which can be used instantaneously 
by our reasoning agents. Due to space limitations, and to preserve readability, we present 
only the fragment of the reasoning ontology that is important for this paper. Our ontology 
consists of concepts (classes), slots (attributes), and pre-defined instances of concepts. 
Fig. 2 shows the “is-a” (subclass) relationship between concepts as solid arrows. Slots 
and their cardinality restrictions are denoted with dashed lines. Instances are connected 
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Subconcept 



< logic ^ - •' 
^ ^ 1..1 



Theorem 



Satlsf iable 



Unsat Isfiable 



a..b 



Slot(-hRestriction) 



Instance 



Calculus 



Fig. 2. The fragment of an ontology for reasoning services 



to their concepts by dotted lines. Fig. 2 contains some of the concepts needed to describe 
first-order theorem proving services and their results. One crucial concept for this paper 
(circled in Fig. 2) is the FO-ATP-Result which denotes results of first-order ATPs. A FO- 
ATP-Result can have a time slot which contains an instance of a time resource description 
(Time-Resource), and a proof of a conjecture (an instance of Proof). Most important, 
the state slot of a FO-ATP-Result always contains one of the valid states of first-order 
ATPs that we developed jointly with Geoff Sutcliffe and Stephan Schulz [11]. This state 
defines the prover’s result for the conjecture given. For instance, the state Theorem says 
that the prover has found out that the given conjecture is a theorem of the given axioms. 

2.3 Service Descriptions 

In our framework, reasoning agents offer services specified in the service description 
language MSDL. Up to now, we distinguish three different types of services. Proving and 
computing services are services that solve given reasoning problems. Examples include 
theorem provers that try to prove a given formula, or computer algebra systems that 
simplify terms or solve differential equations. Transformation and translation services 
are used to change the representation of a problem or a result such as, for instance, the 
translation of first-order logic problems into clause normal form (CNF), or the transfor- 
mation of a resolution proof into Natural Deduction calculus [20] . Classification services 
are services that come up with a refined classification of a given problem. Examples are 
services that, given a proving problem, recognize that the problem is essentially a propo- 
sitional problem, belongs to the guarded fragment of first-order logic, or consists solely 
of horn clauses. These characteristics of proving problem are important for the choice 
of a suitable theorem prover to tackle the problem [16]. 

The projects MathBroker [22] and MONET [7] intend to offer mathematical services, 
described in MSDL, as web services in the Semantic Web. Although MSDL aims at 
describing all kinds of mathematical services, the two projects have only investigated the 
description of symbolic and numeric computation services. We are using our expertise in 
deduction systems to extend the use of MSDL to deduction services. We started with first- 
order ATP services because they have been successfully used in many applications [3,4]. 
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An MSDL document describes many different facets of a reasoning service. We 
briefly describe these facets with an example: the MSDL description of the hrst-order 
proving service SpassATP offered by the ATP Agenti (shown in Fig. 1). We only present 
the most important parts of the rather verbose service description. Due to space limita- 
tions we abstract from MSDL’s XML and present the SpassATP service in a table: 



Service: SpassATP 


classification: 


http : //www .mathweb . org/proving-services\#FO- ATP 


problem: 
input parameters: 
output parameters: 
pre-conditions: 

post-conditions: 


name: problem, signature: Proving-Problem 
name: result, signature: FO-ATP-Result 

essentiallyPropositional{problem) A (OpenMath) 

no Equality (problem) A 
fofF orm(probl em) 
true 


service interface: 


agent(_, %client, $problem) = 

query-if($problem) ==> agent(_, %prover) then 
waitfor inform($result) agent(_, %prover) 
timeout (e) 


implementation details: 


Information about hardware, software (implementation) 



First, a classification of a service is given via a UR? which serves as a reference to a 
problem description library or to an existing taxonomy of services. Then, the service 
is further classihed by the abstract mathematical problem it can tackle. The abstract 
problem solved by the service SpassATP expects only one input: a Proving-Problem as 
it is described in the ontology in Fig. 2. The output of the service is a FO-ATP-Result. 
The pre-conditions of the service say that the service should be used if the problem is 
essentially propositional, contains no equality, and it is presented as first-order formulas 
(as opposed to clause normal form). There are no post-conditions on the output of this 
service. 

The service interface provides information on how to access the service. MSDL 
has been designed to describe the semantics of web services. Therefore, the service 
interface typically contains a document written in the Web Service Description Lan- 
guage (WSDL). However, since our agents communicate via FIPA-ACL messages we 
are experimenting with the use of the MAP language [23] to describe the protocol which 
guides the invocation of a service. In case of the SpassATP service this protocol simply 
states that any agent that acts as a client (has the role % client) should send a qnery-if 
message containing the proving problem to the agent providing the service. Then the 
client has to wait for an inform message which contains the result of the proving attempt. 
The timeout “e” indicates that there is no timeout given. 

Finally, the implementation details of the service contain information about the 
underlying reasoning system, the hardware the service is running on, etc. 



^ Note that Uniform Resource Identifiers (URI) (as opposed to URLs) do not necessarily point 
to existing web resources. 
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3 Example for Advanced Brokering 



Currently, our broker is limited to first-order theorem proving problems and it has only 
basic brokering capabilities . However, we plan to extend our broker to also find sequences 
of services that might solve a problem, in case a single service is not sufficient. In this 
section, we describe a scenario in which this advanced form of brokering is needed. 

Typically, the user of a proof assistant like 17mega [15] has to tackle many sub- 
problems occurring in a large proof development. The user may therefore be interested 
in asking a broker within our reasoning service network to tackle some of these prob- 
lems. More specifically, we now assume that the suhprohlem consists of a set of (local) 
proof assumptions F and a conclusion ip in order-sorted type theory, the logic underlying 
IOMEGA. The user or the proof assistant could, for instance, be interested in the following 
query to a broker: 

Query: Given F and ip, determine whether ip is a logical consequence of F in order- 
sorted type theory. If so, find a Natural Deduction (ND) derivation (proof object) of 
F h Ip. We denote this query with the sequent ' ip, where P{ND)1 means 

that the user is asking for a proof object in ND calculus. 

For our example, we also assume that some reasoning agents have already advertised 
descriptions of the following reasoning services to the broker: 

FO-ATPi ^: Classical resolution-based first-order theorem proving services as described 
in section 2.3. 

HO-ATP: A higher-order proving service offered, e.g., by the theorem prover LEO. It 
takes a Proving-Problem (in type theory) and delivers a proof in ND calculus (HO- 
ND-Proof). 

H02F0: Transforms, if possible, problems in type theory into first-order logic problems. 

Such a service is, for instance, implemented in the Gmega system. 

FOP2ND: Transforms a first-order resolution proof into a ND proof. This service is, for 
instance, offered by the TRAMP system [20] . 

The broker uses the advertised services in his service directory and the service planner 
to find possible sequences of services that may answer Gmega’s query. Fig. 3 shows all 
promising sequences in a disjunctive tree. The problem might be solved directly by the 
available higher-order theorem prover. It might also essentially be a first-order problem 
and is therefore transformable using H02FO. After this transformation, the problem can 
be sent to one of the available first-order ATPs. The resulting FO-ATP-Results can be 
used as input for the service FOP2ND. This service will produce an ND proof in case 
FO-ATPi or FO-ATP 2 could find a resolution proof. 

After plan formation the broker could first try to execute the first branch of a plan. 
If a service application in this branch fails, the broker should try another branch as an 
alternative solution path. Disjunctive plans could also be used to model parallel service 
invocations either to obtain a second, independent result for a problem, or to increase 
the overall performance of the system. 
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FOP2ND 

“T 



ND proof 



Fig. 3. Possible sequences of service calls to answer the query of Iomega 



4 Conclusions and Related Work 

We have presented a new framework for reasoning agents that we are developing on 
top of the JADE agent platform. Our framework is based on formal descriptions of 
mathematical services and problems in MSDL. Using our new framework we can already 
overcome some of the problems we faced with the MathWeb Software Bus. So far, we 
have implemented first-order theorem proving agents and a prototypical broker in JADE. 
Furthermore, we have developed the ontology needed to describe the services of first- 
order ATPs in MSDL. We propose the employment of a plan-based brokering mechanism 
which finds suitable sequences of services to solve a given problem. In the near future, 
we are going to implement this brokering mechanism. However, we are not yet sure 
which reasoning technique is most suitable for reasoning on service descriptions. Euture 
case studies in distributed reasoning might require the flexibility of planning techniques 
similar to proof planning. But it might also be the case that much simpler techniques, 
such as lookup-tables or production systems, are sufficient. 

Furthermore, we might extend our framework by integrating more services, such as 
transformation tools and proof checkers. We are also going to investigate how much of 
our work on first-order ATPs can be used to describe higher-order ATPs. We also intend 
to integrate the work of the MathBroker project on symbolic computation services. 
However, for the integration of computation services with deduction systems (e.g., as 
described in [5]) a more centralized approach such as in the Umega system, seems to 
be more appropriate than the decentralized approach presented here. 

Franke and others have already motivated the use of agent-oriented programming 
and the agent communication language KQML for the integration of distributed mathe- 
matical services [24]. However, their ideas have never been realized in an actual system. 
We think that our work goes further than the ideas presented in [24] for two reasons: 
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1 ) The core of our framework will be the brokering of reasoning services described in 
a formal service description language, and 2) we generally use state of the art Internet 
and multi-agent standards wherever possible. 

A sound combination of deduction systems is particularly difficult because different 
systems are based on different logics and consequence relations. At the current stage 
of our project, we are not addressing this problem. However, the above-mentioned LF 
logical framework might be a suitable meta-logic for our broker. 

The Logic Broker Architecture (LBA) [25] is a system (quite similar to the Math- 
Web-SB) which also aims at a sound integration of logic services using a logic service 
matcher and logic morphisms. The question of how different theorem pro vers can be 
easily combined in a single environment has led to the concept of Open Mechanized 
Reasoning Systems [26]. In OMRS, a mathematical software system is described in 
three layers: the logic layer, the control layer, and the interaction layer. 

The Semantic Web community [27] aims at developing languages and tools for 
annotating web resources, such as web pages and online services, with semantic markup. 
The goal is to develop better search engines for the web and registries for web services. 
The ontology languages OWL and OWL-S are first outcomes of the Semantic Web 
initiative. 



Acknowledgements. Many thanks to Christoph Benzmiiller, Serge Autexier, Geoff 
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References 

1. Homann, K., Calmet, J.: Combining Theorem Proving and Symbolic Mathematical Comput- 
ing. In Calmet, J., Campbell, J.A., eds.: Integrating Symbolic Mathematical Computation and 
Artificial Intelligence; Proc. of the second International Conference;. Volume 958 of LNCS., 
Springer Verlag (1995) 18-29 

2. Harrison, J., Thery, L.: A Skeptic’s Approach to Combining HOL and Maple. Journal of 
Automated Reasoning 21 (1998) 279-294 

3. Zimmer, J., Kohlhase, M.: System Description: The Mathweb Software Bus for Distributed 
Mathematical Reasoning. [28] 139-143 

4. Zimmer, J., Franke, A., Colton, S., Sutcliffe, G.: Integrating HR and tptp2x into Mathweb 
to Compare Automated Theorem Provers. In: Proc. of PaPS’02 Workshop. DIKU Technical 
Report 02-10. Department of Computer Science, University of Copenhagen (2002) 4-18 

5. Melis, E., Zimmer, J., Miiller, T.: Integrating Constraint Solving into Proof Planning. In 
Kirchner, H., Ringeissen, C., eds.: Frontiers of Combining Systems - Third International 
Workshop. Volume 1794 of LNAI., Springer (2000) 32^6 

6. Dewar, M., Carlisle, D., Caprotti, O.: Description Schemes For Mathematical Web Services. 
In: Proc. of Euro Web 2002 Conference: The Web and the GRID, St Anne’s College Oxford, 
British Computer Society Electronic Workshops in Computing (eWiC) (2002) 

7. Consortium, T.M.: The MONET Project, http://monet.nag.co.uk/cocoon/monet/index.html 

(2002) 

8. Fipa: The Foundation for Itelligent Physical Agents Specifications, http://www.fipa.org/ 
(2003) 

9. Caprotti, O., Cohen, A.M.: Draft of the Open Math standard. The Open Math Society, 
http://www.nag.co.uk/projects/OpenMath/omstd/ (1998) 




A Framework for Agent-Based Brokering of Reasoning Services 221 



10. Kohlhase, M.: OMDoc: Towards an Internet Standard for the Administration, Distribution 
and Teaching of mathematical knowledge. In: Proc. AISC’2000. (2000) 

11. Sutcliffe, G., Zimmer, J., Schulz, S.: Communcation Formalisms for Automated Theorem 
Proving Tools. In Sorge, V., Colton, S., Fisher, M., Gow, J., eds.: Proc. of the Workshop on 
Agents and Automated Reasoning, 18th International Joint Conference on Artihcial Intelli- 
gence. (2003) 

12. Smith, M.K., Welty, C., McGuinness, D.L.: Web Ontology Language (2003) Available at 
http://www.w3.org/TR/owl-guide/. 

13. Zimmer, J.: A New Framework for Reasoning Agents. In Sorge, V., Colton, S., Fisher, M., 
Gow, J., eds. : Proc. of the Workshop on Agents and Automated Reasoning, 1 8th International 
Joint Conference on Artificial Intelligence. (2003) 

14. Telecom Italia Lab: Java Agent DEvelopment Framework (JADE). Available at 
http://sharon.cselt.it/projects/jade/ (2003) 

15. Siekmann, J., et al.: Proof Development with GMEGA. [28] 144-149 

16. Sutcliffe, G., Suttner, C.: Evaluating General Purpose Automated Theorem Proving Systems. 
Artificial Intelligence 131 (2001) 39-54 

17. Denti, E., Omicini, A., Ricci, A.: tuprolog: A light-weight prolog for internet applications 
and infrastructures. In Ramakrishnan, L, ed.: Practical Aspects of Declarative Languages 
(PADL’Ol). Number 869 in LNCS (2001) 

18. Polya, G.: How to Solve it. Princeton University Press, Princeton, NJ (1945) 

19. Pfenning, F, Schiirmann, C.: System description: Twelf - a meta-logical framework for 
deductive systems. In: Proc. of the 16th International Conference on Automated Deduction. 
Number 1632 in LNAI, Springer (1999) 202-206 

20. Meier, A.: System description: Tramp: Transformation of machine-found proofs into ND- 
proofs at the assertion level. In McAllester, D., ed.: Automated Deduction - CADE-17. 
Number 1831 in LNAI, Springer Verlag (2000) 460-464 

21. Noy, N.F., Sintek, M., Decker, S., Crubezy, M., Fergerson, R.W., Musen, M.A.: Creating 
Semantic Web Contents with Protege-2000. IEEE Intelligent Systems 2 (2001) 60-71 

22. Schreiner, W., Caprotti, O.: The MathBroker Project. 
http://poseidon.risc.uni-linz.ac. at:8080/index.html (2001) 

23. Walton, C.: Multi-Agent Dialogue Protocols. In: Proc. of the Eighth International Symposium 
on Artificial Intelligence and Mathematics, Fort Lauderdale, Florida (2004) 

24. Franke, A., Hess, S.M., Jung, C.G., Kohlhase, M., Sorge, V.: Agent-oriented Integration of 
Distributed Mathematical Services. J. of Universal Computer Science 5 (1999) 156-187 

25. Armando, A., Zini, D.: Towards Interoperable Mechanized Reasoning Systems: the Logic 
Broker Architecture. In Poggi, A., ed.: Proc. of the AI*IA-TABOO Joint Workshop ‘From 
Objects to Agents: Evolutionary Trends of Software Systems’, Parma, Italy (2000) 

26. Giunchiglia, F, Pecchiari, P, Talcott, C.: Reasoning Theories - Towards an Architecture for 
Open Mechanized Reasoning Systems. In Baader, F., Schulz, K., eds. : Frontiers of Combining 
Systems. Volume 3 of Applied logic series., Kluwer, Netherlands (1996) 157-174 

27. Berners-Lee, T, Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284 ( 5 ) 
(2001) 34-43 

28. Voronkov, A., ed.: Proc. of the 18th International Conference on Automated Deduction. 
Number 2392 in LNAI, Springer Verlag (2002) 




Faster Proximity Searching in Metric Data* 



Edgar Chavez^ and Karina Figueroa^ 

^ Universidad Michoacana, Mexico, 
{elchavez , karina}@f ismat . umich . mx 
^ DCC Universidad de Chile, Chile 



Abstract. A number of problems in computer science can be solved 
efficiently with the so called memory based or kernel methods. Among 
this problems (relevant to the AI community) are multimedia indexing, 
clustering, non supervised learning and recommendation systems. The 
common ground to this problems is satisfying proximity queries with an 
abstract metric database. 

In this paper we introduce a new technique for making practical indexes 
for metric range queries. This technique improves existing algorithm 
based on pivots and signatures, and introduce a new data structure, 
the Fixed Queries Trie to speedup metric range queries. The result is 
an 0(n) construction time index, with query complexity 0(n°‘),a < 1. 
The indexing algorithm uses only a few bits of storage for each database 
element. 



1 Introduction and Related Work 

Proximity queries are those extensions of the exact searching where we want to 
retrieve objects from a database that are close to a given query object. The query 
object is not necessarily a database element. The concept can be formalized using 
the metric space model, where a distance function d{x, y) is defined for every site 
in a set X. The distance function d has metric properties, i.e. it satisfies d{x, y) > 
0 (positiveness), d{x,y) = d{y,x) (symmetry), d{x,y) = 0 iS x = y (strict 
positiveness), and the property allowing the existence of solutions better than 
brute-force for similarity queries: d{x, y) < d{x, z) + d{z, y) (triangle inequality). 

The database is a set U C X, and we define the query element as q, an 
arbitrary element of X. A similarity query involves additional information, be- 
sides q, and can be of two basic types of proximity queries: {q,r)u = {m G U : 
d{q,u) < r}. Metric Range queries and nnk{q)d = {ui G U : Vu G U, d{q,Ui) < 
d{q,v) and |{Mi}| = k}. K nearest neighbor query. 

The problem have received a lot of attention in recent times, due to an 
increasing interest in indexing multimedia data coming from the web. For a 
detailed description of recent trends in the also called distance based indexing 
the reader should see [6]. If the objects are vectors (with coordinates), then a 
recent kd-tree improvement can be used [7], we are interested in the rather more 
general case of non-vectorial data. 
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1.1 Related Work 

There are two basic paradigms for distance based indexing, pivot-based algo- 
rithms and local partition algorithms, as described in [6], there the authors 
state the general idea for local partition algorithms which is to build a locality- 
preserving hierarchy, and then to map the hierarchy levels to a tree. Pivoting 
algorithms, on the other hand, are based on a mapping to a vector space us- 
ing the distance to a set of distinguished sites in the metric space. Since our 
algorithm is also pivot-based we concentrate on this family of algorithms. 

An abstract view of a pivot based algorithm is as follows. We select a set of I 
pivots {pi , ... ,pi}. At indexing time, for each database element a, we compute 
and store <P{a) = {d{a,pi)...d{a,pi)). At query time, for a query {q,r)j_, we 
compute <l>{q) = (d{q,pi)...d{q,pi)). Now, we can discard every a G U such that, 
for some pivot pi, \d{q,pi) — d{a,pi)\ > r, or which is the same, we discard every 
a such that 



max \d{q,p,) - d{a,pi)\ = L^{^{a),^{q)) > r. 

1<2<Z 

The underlying idea of pivot based algorithms is to project the original met- 
ric space into a vector space with a contractive mapping. We search in the new 
space with the same radius r, which guarantees that no answer will be missed. 
There is, however, the chance of selecting elements that should not be in the 
query outcome. This false positives are filtered using the original distance. The 
more pivots used, the more accurate is the mapping and the number of distance 
computations is closer to the number of elements in the query ball. The differ- 
ences between each indexing algorithm is how is implemented the search in the 
mapped space. A naive solution is to search exhaustively in the mapped space [9], 
or to use a generic spatial access method like the R-tree[8]. Both of this solutions 
are acceptable if the dimension of the mapped space is maintained low, but can 
be worse if the dimension of the mapped space is high. 

A natural choice for measuring the complexity of a proximity query in metric 
spaces is the number of distance computations, since this operation has leading 
complexity. The distance computations are in turn divided in inner and outer 
complexity, the later is the size of the candidate list and the former the distances 
to the pivots. A more realistic setup must include what is called side computa- 
tions or the cost of searching in the tree (or data structure in general) to collect 
the list of candidates. 

Some of the pivoting algorithms [2,4, 10] have no control on the number of 
pivots used. If we want to have an effective control in reducing the size of the 
candidate list we must have arbitrarily taller trees, or a fixed number of pivots 
in the mapping. This is the approach of the Fixed Height Fixed Queries Trees 
(FHQT) [1], where the height of the tree is precisely the number of pivots, or the 
dimension of the mapped space. A serious drawback of the FHQT is the amount 
of memory used, because even if a branch in the tree collapses to a single path 
for each additional node we must save the distance and some additional node 
information. 
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Actual instances of datasets show that the optimum number of pivots (bal- 
ancing the external and internal complexity) cannot be reached in practice. For 
example, when indexing a dictionary of 1,000,000 English words we can use as 
much as 512 pivots without increasing the number of distance computations. 

The Fixed Queries Array [5] showed a dramatic improvement (lowering) on 
the size of the index induced by completely eliminating the tree, and working 
only with an ordered array representing the tree. In this approach every tree 
traversal can be simulated using binary search, hence adding a logn penalty in 
the side computations but notably decreasing the external complexity or the size 
of the candidate list. 

In this paper we will show a further improvement on the FQA algorithm, elim- 
inating the logarithmic factor in the searching. This will be done by designing a 
way to manipulate whole computer words, instead of fetching bytes inside them. 

1.2 Basic Terminology 

The set of pivots K = {pi, • • • ,pfe} is a subset of the metric space X. 

A discretization rule is an injective function Sp : K"*" x K — >■ {O,--- ,2^*> — 
1} mapping positive real numbers into 2*''’ discrete values. The discretization 
rule depends on the particular pivot p. The preimage of the function defines a 
partition of K’*'. We assume Sp{r) will deliver a binary string of size b. Defining 
the discretization rule as pivot-depending allows to emphasize the importance of 
a particular pivot. For simplicity the same rule may be applied to all the pivots. 
A signature function for objects will be a mapping <5* : X — >■ {0, 1}™ with 
TO = gi’^en by the rule (5*(o) = <5pi(d(o,pi)) • • • dp^{d{o,pk))- 

An infinite number of discretization rules may be defined for b bytes. Some of 
them will lead to better filtering methods and some of them will not filter at all. 
The definition of the discretization rule is not essential for the correctness of the 
filtering method, but for its efficiency. We can generalize the signature function 
for intervals, in the same way a function is extended from points to sets. 

A discretization rule for intervals is denoted Sp{[ri,r 2 \) = {bp{r)\r G 
[i^it ?' 2 ]}- Formally Sp is defined for positive real numbers. This definition may be 
extended to the domain of real values, assuming <j([ri,r 2 ]) = <5([0,r2]) if ri < 0. 
A signature function for queries will be a function 5* : X x K+ — >■ \ 

mapping queries {q,r)d into a signature set. The signature of a query will be 
given by the following expression 

= {6p^{[d{q,pi) -r,d{q,pi) -fr])} (1) 

{Sp^{[d{q,p2) -r,d{q,p2) +?"])}••• (2) 

{Spk{[d{q,Pk) -r,d{q,pk) + r])} (3) 

which is any ordered concatenation of the signature sets for the corresponding 
intervals in each pivot. 

Claim 1. If an object o G X satisfies a query {q,r)d, then <5*(o) G S*{{q,r)d)- 
To prove the above claim, it is enough to observe that S*{{q,r)d)) is an 
extension of i5*(-)) and hence o G {q,r)d implies i5*(o) G 6*{{q,r)d)- 
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The candidate list is the set [( 7 ,r]d = {o|<5*(o) G r)^)}. Note that the 

candidate list is a superset of the query outcome, in other words, (q, r)d Q [q, r]d- 
Remark 1. Computing the candidate list [q, r]d implies only a constant number 
of distance evaluations, namely the number of pivots. 

Remark 2. The candidate list asymptotically approaches the size of the query 
outcome, as the number of pivots increases. This fact has been proved in [6] but it 
is easily verified, observing that {q,r)d Q [q,'<’]d and that increasing the number 
of pivots never increases the size of [q,r\d- For any finite sample U of X, if we 
take all the elements of U as pivots, then {q,r)d = [q,r]d provided {q,r)d C U. 
This proves the assertion. 

Remark 3. Increasing the number of pivots increases the number of distance 
evaluations in computing S*{{q,r)d). This is trivially true, since computing S*{-) 
implies k distance computations. If we take as pivots all of U, then we are 
changing one exhaustive search for another. 

2 The Index 

The goal of an indexing algorithm is to obtain the set (g, r)d using as few distance 
computations as possible. The indexing algorithm consist in preprocessing the 
data set (a finite sample U of the metric space X) to speed up the querying 
process. Let us define some additional notation to formally describe the problem 
and the proposed solution. 

The index of U denoted as U* is the set of all signatures of elements of U. 
U* = (i*(U) = {,5*(o)|oG U}. 

Remark 4. |U*| < |U|. The repeated signatures are factored out and all the 
matching objects are allocated in the same bucket. 

With this notation [q,r]d may be computed as [q,r]d = <5*(U) fl 6*{{q,r)d)- 
To satisfy a query we exhaustively search over [q,r]d to discard false positives. 
The complete algorithm is described in figure 1. 

We assume a blind use of Sp(-), the discretization function (optimizing Sp{-) 
has been studied empirically in [5]), since we are interested in a fast computation 
of [q,r]d- Nevertheless a fair choice is to divide the interval of minimum to 
maximum distances in 2^ equally spaced slices. 



2.1 Lookup Tables 

The core of the searching problem with signatures consist in computing [q,r]d- 
Observe that there are exponentially many elements in S*{{q, r)d)- The signatures 
are obtained as an ordered concatenation of signatures for each pivot. If each 
pivot produces l'p^ signatures for its range, then we will have up to 
signatures. For example, for 32 pivots generating as few as 2 signatures for a 
query, the number of signatures is 2^^. Instead we can split each signature in t 
computer words. A query signature may be represented as t arrays with at most 
2™ elements. A signature oi • • • a™ G {0, 1} is splitted in t binary strings of size 
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Generic Metric Range Search 

(X, d) the metric space, U is a database (a sample of X) of size n 
K = {pi, • • • ,pk} is the set of pivots, k = |IC|, bp. the number of 
bits for each pivot c5(-) the discretization rule, g € X the query object 
Startup:(U,ltC,{&pJ, S{-)) 

1. Compute U* 

Search: (g, r) 

2. Compute [g, r]d 

for each o G [q,r]d do 
if d{q, o) < r 
{q,r)d <- o 
fi 
od 



Fig. 1. A generic indexing algorithm based on signatures. The index is U*, the objective 
is to compute (g, r)d given g and r. 



w of the form Ai ■ ■ ■ At with At computer words. Each At is called a coordinate 
of a signature. A query signature will be represented as t sets of coordinates. 

A lookup table for query signatures is an array Lj [] of 2™ booleans, 1 < 
j < t. Lj\i] = true if and only if i appears in the j-th set of coordinates. Note 
that computing Lj\\ can be done in constant time (at most 2™). 

Remark 5. We can decide if a particular signature is in the signature of the 
query by evaluating the boolean AND expression Lj[Ai] ® ® Lj[At], which 

takes at most t table fetches for an evaluation, on the average we may use fewer 
than t operations. 

2.2 Sequential Scan 

Remark 5 leads to a direct improvement of the sequential scan (FQS). We will 
compare the straight sequential scan with two better alternatives. If the query 
(g, r)d has low selectivity the size of the candidate list [g, r]d will be very large. An 
clever (sublinear) algorithm for finding the candidate list will be time consuming. 
Figure 2 illustrates the procedure to obtain the candidate list. 

2.3 The Fixed Queries Array with Lookup Tables 

The Fixed Queries Array (FQA) was proposed in [5] to increase the filtering 
power of a pivoting algorithm. Unlike sequential scan the approach is sublinear. 
The general idea of FQA is to keep an ordered array of the signatures. For the 
first pivot the query signature will be an interval in the array of signatures. 
This interval may be found using binary search with an appropriate mask. The 
process in repeated recursively for the successive pivots, which will appear as sub- 
intervals. The FQA may be improved using the lookup tables, allowing whole 
word comparisons (which is faster than fetching many times a few bits inside). 
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Compute [q,r]d using Sequential Scan (FQS) 

U* is the signature set n, Lj\\ is the lookup table for the query {q, r)d 
1. Compute [g, r]d 

for each signature s = ^4i • • • At G U* do 
for j=l to t do 
if not Lj[Aj\ 
next s 
fi 

od 

[<j, r]d -s- Object(s) 
od 



Fig. 2. A sequential scan to compute [q,r]d. The complexity is linear on the size of 
the database U. If the query has low selectivity the sequential scan beats a clever 
(sublinear) approach. 



Given a query (<?, r)^, the signature vector for the i-th coordinate will be 
denoted as Ai\\. Figure 3 describes the recursive procedure to compute [q,r]d- 
The FQA implementation with lookup tables is faster in practice than the plain 
algorithm in [5], and has the same theoretical 0(log(n)) penalty. 



Compute [q,r]d using FQA 

U* is the signature set n, which is ordered by coordinates. 
{Ai[]} are the signature vectors of query {q,r)d 
Compute [q,r]d 

1. function FQA(int j, V*) 
if j=t then 

for each A G At[] do 

[q,r]d •(— Object(Select(V, A, t)) 

od 

return 

fi 

for each A G Aj[] do 

FQA(j+l, Select (V, A, j)) 
od 



Fig. 3. Computing \q, r]d using a recursive binary search. The function Select(V, A, j)) 
obtains the signatures in V whose j-th coordinate matches A. It may be implemented 
in logarithmic time if V is ordered, which takes no extra memory, but a nlog(n) penalty 
for signature ordering. 
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2.4 The Fixed Queries Trie 

Since we are using the signatures as strings, and want to find matching strings 
we have a plethora of data structures and algorithms from the string pattern 
matching community. Among them we selected a trie to index the signatures 
set U*. We will denote this trie as Fixed Queries Trie or FQh, the lowercase to 
distinguish between the trie and the tree (the Fixed Queries Tree) FQT described 
in [2], 

Building the FHQt is even faster than building the FQA, the reason is that 
we don’t need to use a sort, inserting each site’s signature is done in time propor- 
tional to the signature length. The construction complexity, measured in distance 
computations, is then 0{n) (as the FQA). Nevertheless the side computations 
for the construction is 0{n) instead of 0(n log n). 

A trie is an m-ary tree with non-data nodes (routing only) and all the data at 
the leafs [3]. Searching for any string takes time proportional to the string size, 
independent of the database size. For our purposes the basic searching algorithm 
is modified to allow multiple string searching, and we make heavy use of the 
lookup table for routing. Our string set will be represented as a lookup table, 
this lead to a slightly different rule for following a node. Instead of following a 
node matching the t-th character of the string (z the trie level) we will follow 
the node if any coordinate in the set matches the node label. In other words, we 
will follow the node if the lookup table is true for the corresponding node and 
level. 

The number of operations for this search will be proportional to the number 
of successful searches, or the number of leafs visited, times the length of the 
strings. In other words the complexity will be 0{\\q,r]d\t). This represent a 
log(n) factor, with respect to the FQA implementation. 

The space requirements for the FQt may be smaller than that of FQA. The 
number of paths in the FQt is the number of elements in the signature array, 
but the trie will factor out matching coordinates and will use additional memory 
for pointers. The net result is that the trie will use about the same amount of 
memory, is built faster and uses less time for matching queries. 

3 Experimental Results 

The FHQt is exactly equivalent to the FQA and the FQS, from the point of view 
of selectivity, internal and external complexity. The same parametric decisions 
can be made using exactly the same analysis. We remark that the only difference 
is in the amount of extra work to find the candidate list. In that view we only will 
make an experimental validation of the algorithms improved with lookup tables, 
and hence will compare only the side computations and in a single experiment 
will compare the overall complexity. A more complete experimental study of the 
algorithm will be reported with the final version of this paper. 

The three algorithms were implemented using lookup tables. The algorithms 
are only compared themselves, since it has been proved that FQA’s may beat 
any indexing algorithm just by using more pivots. This is also true for FQt an 
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Compute [q,r]d using FQt 

U* is the signature set n, which is arranged in a trie. 

are the lookup tables of the query {q,r)d Each node of the trie is labeled 

after a coordinate U 

Compute [q,r]d 

1. function FQt(node U,i) 
if i geq t then 
if Lt[U] then 

[q,r]d Object([/.signature) 
return 
fi 

for each U— > node do 
if Li[U] then 

FQt(t7— > node,i + 1) 
fi 

od 



Fig. 4. Computing [q,r]d using a trie. The trie is faster than the FQA because the 
recursion is done without searching. 

FQS, hence the interest is between the three surviving alternatives. The previous 
implementations of the FQA (using bit masks) are beaten by the lookup table 
implementation by a large factor. 

Ee indexed a vocabulary of the English dictionary, obtained from the TREC 
collection of the Wall Street Journal under the edit distance. This is a known 
difficult example of proximity searching, with high intrinsic dimension, as de- 
scribed in [6]. Figure 5 shows the overall time needed to compute a query of 
radius 1 and 2. Each point in the plots, for each experiment, were obtained 
by averaging 200 queries in the database. It can be noticed that as the query 
becomes less selective, the overall time of the algorithms FQA, FQS and FQS 
becomes practically the same. This is explained because the three algorithms 
have the same filtering power and the time to compute the candidate list makes 
a tiny difference. Figure 6 (top) shows the differences between the three filtering 
algorithms, at the left the overall time, and at the right only the time for filter- 
ing. Figure 6 (bottom) shows the performance of the indexes for medium and 
low selectivity queries (with radius one and two respectively). The time plotted 
is only the filtering time, since the number of distance computations will be the 
same for all of them. The lookup tables are computed for the three algorithms, 
the construction is slightly more costly for fewer bits. 

4 Final Remarks 

Pivoting algorithms have proved to be very efficient and sound for proximity 
searching. They have two control parameters, the number of pivots and the 
number of bits in the representation. Using lookup tables to implement the 
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Fig. 5. The overall time needed to satisfy a queries of radius one (left) and two (right). 
The database is a sample of vocabulary of the TREC WSJ collection, under the edit 
distance. 



I 




Fig. 6. (TOP) If we increase the number of pivots the time to obtain the candidate 
list is increased. The FQA and the FQS will use about the same time (about linear) 
for low selectivity queries, while the FQt increases very slowly. (BOTTOM)The role 
of the number of bits in the filtering time. For a fixed number of pivots, and a fixed 
database size. 



three alternatives presented in this paper gives a faster filtering stage. We also 
saw a strictly decreasing overall time for a particularly difficult example. The 
experimental evidence and the analysis allows to favor the use of the FQt over 
the FQA or FQS in most realistic environments. Additionally the FQA and FQS 
don’t accept insertion or deletions, unlike the FQh where insertion or deletions 
(of non-pivot objects) can be carried out trivially. 
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Abstract. One of the best prevention measures against breast cancer is the early 
detection of calcifications through mammograms. Detecting calcifications in 
mammograms is a difficult task because of their size and the high content of 
similar patterns in the image. This brings the necessity of creating automatic 
tools to find whether a mammogram presents calcifications or not. In this paper 
we introduce the combination of machine vision and data-mining techniques to 
detect calcifications (including micro-calcifications) in mammograms that 
achieves an accuracy of 92.6 % with decision trees and 94.3 % with a back- 
propagation neural network. We also focus in the data-mining task with 
decision trees to generate descriptive patterns based on a set of characteristics 
selected by our domain expert. We found that these patterns can be used to 
support the radiologist to confirm his diagnosis or to detect micro-calcifications 
that he could not see because of their reduced size. 



1 Introduction 

Breast cancer is the second cause of death in women with cancer after cervical-uterine 
cancer; this is why breast cancer is considered a public health problem. Statistical data 
from INEGI shows that in 2001, breast cancer was the 12* cause of death for Mexican 
women with 3,574 deaths. 

Early diagnosis of breast cancer is the best-known solution for the problem [5]. 
The level of affection of cancer is related to the size of the tumor. In the case of small 
patterns (not palpable), we need another reference such as a mammogram. With the 
use of mammograms, the mortality caused by breast cancer has decreased in a 30% 

[9]. 

A mammogram study consists of a set of four images, two craniocaudal and two 
lateral images. The radiologist has to look in the images for calcifications and make a 
diagnosis that is consistent with the images. This is a difficult task that only trained 
experts can do with high confidence. Even when an expert makes the diagnosis, there 
are factors that may affect his decision such as tiredness. One way to make this 
process more objective consists on creating an automatic way to detect calcifications 
in a mammogram to give the radiologist a second opinion. In this way, the automatic 
system may confirm the radiologist’s diagnosis or may suggest that a suspicious area 
in the mammogram can be a calcification. The automatic method presented in this 
paper is based on the combination of machine vision and data mining technologies 
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and can be used to create an automatic system to provide a second opinion for the 
diagnosis of the radiologist. 

A lot of efforts have been done to solve this problem. In [11], the authors use a 
Bayesian network to find and classify regions of interest. In this research they use a 
segmentation algorithm based on wavelet transforms and the use of a threshold that 
corresponds to the local minimum of the image. In [1], the authors use data mining 
techniques to detect and classify anomalies in the breast. They use neural networks 
and association rules as the data mining algorithms. They also use a histogram 
equalization process to enhance the image contrast, and then they perform a feature 
extraction process and include those features combined with patients’ information 
(such as age of the patient) for the classification task. Their results were of 82.248 % 
with neural networks and 69.11 % with association rules. In [3], the authors present a 
method for feature extraction of lesions in mammograms using an edge based 
segmentation algorithm. In [8], we find an evaluation of different methods that can be 
used to get texture features from regions of interest (calcifications) extracted from 
mammogram images. In [10], we can see how the wavelet transformation has been 
used to detect groups of micro-calcifications in digital mammograms. In this paper, 
the authors only use the wavelet transformation to detect those groups of micro- 
calcifications without the help of any other algorithm. 

Through this paper, we will present our method starting in section 2 with a brief 
description of the mammograms database that we used for our experiments. In section 
3 we introduce our methodology with all its components. In section 4 we present our 
experiments and results and finally, in section 5 we show our conclusions and future 
work. 



2 Mammograms Database 

For our experiments we are creating a mammograms database in coordination with 
our domain expert. Dr. Nidia Higuero, a radiologist from the ISSSTEP hospital. Until 
now we have a set of 84 cases of mammograms (one case per patient), each case 
contains four images, one craniocaudal and one oblique view of each breast. The 
images were digitized with an Epson Expression 1680 fire-wire scanner at 400 dpi’s, 
with a size of 2,500 x 2,500 pixels in bmp format. Erom the 84 cases, 54 have 
calcifications and 30 are normal (with no calcifications). Eigure la shows an example 
of a mammogram of a healthy breast and figure lb shows a magnified area of a 
mammogram with calcifications (the bright areas in figure lb correspond to 
calcifications). Detecting a calcification is a difficult task because calcifications can 



a) 





Fig. 1. Mammogram Images, a) Image of a healthy breast, b) Magnified area of a mammogram 
with calcifications 
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be very similar to other areas of the image (this is not the case of figure lb where we 
chose a region where calcifications could be easily identified). Our domain expert 
selected the set of cases and gave them to us for scanning. After the images were 
digitized, Dr. Higuero put marks to those images with calcifications in the places 
where those were found; we will refer to these marked images as positive 
mammograms. We needed these marked images for training purposes, as we will 
mention in the methodology section. 



3 Methodology 

The combination of machine vision and data mining techniques to find calcifications 
in mammograms is shown in figure 2. Figure 2a shows the knowledge discovery in 
databases (KDD) process that we use to find patterns that describe calcifications from 
known images (those images for which we know if they contain calcifications or not). 
The process starts with the image database that consists of the original and marked 
mammograms. As we mentioned before, marked mammograms identify where 
calcifications are located in a positive mammogram. An image preparation process is 
applied to these images to make calcifications easier to detect. After this, a 
segmentation algorithm is applied to each positive image in order to get our positive 
Regions Of Interest (ROTs) corresponding to calcifications. A different segmentation 
algorithm is used to get our negative ROTs, that is; regions of interest of areas that are 
very similar to calcifications but that are not. After this step we have a ROTs database 
where each ROl is classified as positive or negative. Next, we apply a feature 
extraction process to each region of interest to create a feature database that will be 
used to train the data mining algorithms. In our case we use a back-propagation neural 
network and a decision tree for the data-mining step. After applying the data-mining 
algorithm, we get patterns to be evaluated in the pattern evaluation step. In the case of 




Fig. 2. KDD Process Applied to Find Calcifications in Mammograms, a) Training Phase, b) 
Diagnosis Phase 
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the decision trees, our patterns are human understandable as we will show in the data 
mining section but the patterns found with the neural network are difficult to interpret 
because they are hidden in the neural network weights and architecture. The patterns 
found in the training phase will be used in the diagnosis phase shown in figure 2b. 
The diagnosis starts with the image to be analyzed. We first perform the image 
preparation filter and then use a segmentation algorithm to find our ROI’s (in this 
case we do not have a marked image) that correspond to possible calcifications. Once 
we have our ROI’s, we execute the feature extraction algorithm to get our feature 
database, where a feature vector represents each ROI. In the next step, we compare 
each ROI (represented by its feature vector) with the patterns found in the training 
phase and if the ROI matches any of the positive patterns we diagnose that ROI as 
positive or a calcification and as negative otherwise. In the following sections we give 
more detail about each of the steps shown in figure 2. In section 3.1 we describe the 
image processing steps and in section 3.2 we describe the data mining algorithms 
used. 



3.1 Image Processing 

In order to find patterns that distinguish a calcification from other tissues, we need to 
get examples of those parts of the image that correspond to calcifications and also 
examples of parts of images that look like calcifications but that are not. We call these 
parts of images regions of interest (ROI). In order to make more effective the ROI's 
identification we need to include a noise reduction process that is achieved with the 
wavelet transformation [10, 11]. Once we found our ROI’s, we extract characteristics 
(or features) that will be used as a representation of the ROI’s for the data mining 
algorithms. Section 3.1.1 describes the algorithm used to find regions of interest and 
section 3.1.2 describes the features that are extracted from each ROI. 



Regions of Interest. Finding ROI’s in a mammogram is a difficult task because of the 
low contrast with other regions and the differences in types of tissues. ROI’s in the 
training phase are found using the segmentation algorithm shown in figure 3. The 
algorithm receives as input the set of original images and the set of marked images. 
As we mentioned in section 2, the set of marked images corresponds to the images 
where our domain expert marked the calcifications that she found. In line 2 of the 
algorithm, we initialize the ROIS variable to be the empty set. In lines 4 and 5 we 
apply the symmlet wavelet transformation to each pair of original and marked images 
in order to stress the places where a calcification might be found and to make a more 
defined background. In lines 6 to 9, we find the position of the area of each marked 
calcification and get the region of interest from the original image for that position 
and keep storing each ROI in the ROIS variable until we have processed every pair of 
images. Once we generated the complete set of regions of interest, we are ready for 
the feature extraction process explained in the following section. 

The algorithm used to find negative ROI’s and also to find ROI’s in new images 
differs from the algorithm in figure 3 because there are not marked images for those 
cases. As shown in figure 4, the testing segmentation algorithm starts with a wavelet 
filter application to the original image (see line 2 of figure 4) as we described for the 
training segmentation algorithm. In line 4, we use a global threshold segmentation 
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I. TrainingSegmentation (SetOfOriginallmages , SetOfMarkedImages ) 

2 . ROIS = {(|)} 

3. For each pair of images: Originallmage , Markedimage 

4. WOriginallmage = SymsWavelet (Originallmage) 

5. WMarkedImage = SymsWavelet (Markedimage) 

6. For each calcification mark in WMarkedImage 

7. Position = LocateCalcificationArea (WMarkedImage) 

8. ROIS = ROIS u GetCalcificationArea (WOriginallmage) 

9 . EndFor 

10. EndFor 

II. End Segmentation 

Fig. 3. Training Segmentation Algorithm 

1 . TestingSegmentation (Originallmage , MinSize , MaxSize) 

2. WOriginallmage = SymsWavelet (Originallmage) 

3. ROIS = {(})} 

4 . ROIS = GlobalThresholdSegmentation (WOriginallmage, Threshold) 

5. For each ROI in ROIS 

6. If Area (ROI) < MinSize 

7. ROIS = ROIS - ROI 

8. Endlf 

9. If Area (ROI) > MaxSize 

10. ROIS = ROIS - ROI 

11. ROISInNodule = EdgeSegmentation (ROI ) 

12. ROIS = ROIS + ROISInNodule 

13. Endlf 

14 . EndFor 

15. For each ROI in ROIS 

16 . LocalThresholdSegmentation (ROI ) 

17. FinalROIS = FinalROIS + ROI 

18. EndFor 

19. Return FinalROIS 

Fig. 4. Testing Segmentation Algorithm 

algorithm to find ROI’s. After this, we eliminate those small ROFs that, because of 
their size cannot be a calcification, not even a micro-calcification (lines 5 to 8). We 
keep those ROI’s that have a size in the range of the size of those ROFs in the 
training set. Next, we apply a local edge segmentation algorithm to those ROFs that 
are too large to be considered a calcification but that could contain one or more 
calcifications inside, these regions are called nodules and we also get those ROFs 
inside a nodule as possible calcifications (lines 9 to 13 in figure 4). Then, in lines 15 
to 18, we keep all ROFs that might be calcifications and apply to them a local 
threshold segmentation algorithm to eliminate small imperfections in the image called 
artifices and also improve the quality of the edges of calcifications. Finally, we 
perform the feature extraction task over these ROFs to create a feature database of 
negative ROFs for the training case or a feature database for the diagnosis case. 



Feature Extraction Algorithm. The feature extraction algorithm shown in figure 5 
receives as input the set of ROFs found with the segmentation algorithms shown in 
figures 3 and 4. This algorithm processes each region of interest to extract the 
characteristics: area, diameter, density, convexity, internal ellipse radius, external 
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1. FeatureExtraction (ROIS) 

2. FeatureVector [ I ROIS I ] = [(|)] 

3. For i = 1 to I ROIS I 

4. FeatureVector [i] = GetFeatures (ROIS [i] ) 

5 . EndFor 

1. End FeatureExtraction 

Fig. 5. Feature Extraction Algorithm 

ellipse radius, orientation, circularity, eccentricity, roundness, and contour length. We 
chose these characteristics with the help of our domain expert trying to consider what 
she uses at the time she does a diagnosis. We experimented with texture features [2] 
but we got better results with the features recommended by our domain expert. All the 
features extracted from each ROI are stored in a feature vector database that will be 
used to feed our data mining algorithms in the next phase of the process. 



3.2 Data Mining 

Data mining is the task of finding interesting, useful, and novel patterns from 
databases [4]. In our case we want to find patterns that describe calcifications in 
mammograms so that we can use them to predict whether a new mammogram has 
calcifications or not. As we mentioned before, we use a backpropagation neural 
network and a decision tree as our data mining algorithms. Neural networks have the 
property of achieving high accuracies for the classification task but what they learn is 
not easy to understand. On the other hand, decision trees are known to achieve high 
accuracies in the classification task and are also easy to understand. 



Neural Networks. For a long time researchers have tried to simulate how the human 
brain works with mathematical models called neural networks. A neural network is 
composed of a set of cells that are interconnected in a layer fashion. The first layer is 
called the input layer and its function is to pass the input signals to the next layer. 
Each cell in the next layer (intermediate cells) receives a signal from each cell in the 
previous layer modified by a weight factor. The intermediate cells calculate their 
output signal according to a function over its input signals, in the case of the 
backpropagation algorithm [7]; the output signal is calculated with the sigmoid 
function and then passed to the next layer cells. Once the signals reach the output 
cells, the result is compared with the real classification of the input feature vector and 
the error is propagated to the previous layers by adjusting the weights of each 
connection between cells. In our experiments, we use a Neural Network (NN) with 1 1 
input nodes, 1 hidden layer with 5 nodes and two output nodes. As we can see in 
figure 6, the input nodes receive the feature vector of each ROI and the output nodes 
show the classification of the NN for the given input vector. The possible 
classifications are positive (the feature vector corresponds to a calcification) or 
negative (the feature vector does not correspond to a calcification). The NN is trained 
with the backpropgation algorithm with a learning rate of 0.01 and for 500 epochs 
comparing the actual output of the network with the real output for each ROI and 
changing the weights of the network. 
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Fig. 6. Neural Network Architecture 




Fig. 7. Partial Decision Tree 



Decision Trees. Decision trees are a classification method that generates a tree to 
classify a set of input examples according to their class [6]. Each branch in the tree 
represents a decision. Each node in the tree refers to a particular attribute. Edges 
connecting nodes are labeled with attribute values and leave nodes give a 
classification that applies to the examples that were reached through that branch. At 
each step of the tree construction a node is selected according to a statistical measure 
called information gain, that measures how well a node (attribute) distributes the input 
examples with respect to their class. Eigure 7 shows part of an example of a decision 
tree for our domain. As we can see, the root node for the tree is the area node. If area 
has a value of less or equal to 13, we verify the value of the diameter attribute. If 
diameter has a value of less or equal to 2 then the class of the example is negative. If 
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Table 1. 10-Fold Cross Validation Results 



Algorithm 


10-FoldCV Result 


Standard Deviation 


Neural Networks 


94.3 % 


1.7127 


Decision Trees 


92.6 % 


0.2662 



1) If area <= 13, and diameter <= 2, then calcification = negative 

2) If roundness > 0.68, and contour length > 10.49, then calcification = positive 

3) If density > 6.60, then calcification = negative 

4) If external ellipse radius <= 1.19, then calcification = positive 

Fig. 8. Decision Rules from the Calcification Detection Domain 

the value for diameter is greater than 2, we verify again the value of the diameter 
attribute and if it is less or equal to 2.83, we verify the convexity attribute. If the 
convexity attribute has a value of less or equal to 0.93, the class is negative, otherwise 
the class is positive. For our experiments with decision trees we used the c4.5 
algorithm described in [6]. 

In the following section we will show our experimental results using the machine 
vision and data mining algorithms for the task of calcification detection in 
mammograms. 



4 Experiments and Results 

For our experiments we follow the process described in figure 2. We find positive and 
negative regions of interest from 70 craniocaudal mammogram images with 
calcifications and 60 mammogram images with no calcifications. We mentioned 
before that we had 54 cases with calcifications but this does not mean that both 
craniocaudal images of a case have calcifications; this is why we only have 70 
mammogram images with calcifications. From these images, we obtained a total of 
653 ROFs, from which 326 are positive (calcifications) and 327 are negative (non 
calcifications). We performed the feature extraction process to these ROFs and 
trained our data mining algorithms with them. We used the 10 fold cross validation 
technique (that is, we used 90% of the examples for training and the remaining 10% 
for testing in each of the 10 trials) to evaluate the algorithms performance and got the 
results shown in table 1 . 

As we can see in table 1, we got better accuracy results with the neural network 
algorithm than with the decision tree, but we can also see that the standard deviation 
with the neural network is higher than with the decision tree. The only problem with 
neural networks is that they are not easy to understand and we needed to show the 
learned patterns to our domain expert. This is why we generated rules from the 
decision tree and asked our domain expert to study them. Dr. Fliguero found the rules 
very interesting and told us that she was relating them to the way she does her 
diagnosis. Figure 8 shows some examples of these rules. The first rule says that if the 
area of the ROI is less or equal to 13, and its diameter is less or equal to 2, then it 
might not be a calcification. Rule 2 says that if the roundness of the ROI is greater 
than 0.68, and its contour length is greater than 10.49, then it might be a calcification. 
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Rule 3 states that if the density of the ROI is greater than 6.60, it might not be a 
calcification. Finally, rule 4 says that if the external ellipse radius of a ROI is less or 
equal to 1.19, then it might be a calcification. In a different experiment with Dr. 
Higuero we discovered that the system could detect very small microcalcifications 
that she could miss because of their reduced size. 



5 Conclusion 

The experimental results show that our method combining machine vision and data 
mining techniques for the calcification detection task from mammograms has been 
successful achieving an accuracy of 94.3% with a neural network trained with the 
backpropagation algorithm and an accuracy of 92.6% with decision trees. The 
accuracy was calculated using the 10-fold cross validation technique. Our domain 
expert also told us that the accuracy achieved was good enough to implement our 
method as a Computer Aid Diagnosis (CAD) system to give a second opinion to the 
radiologist. She also told us that our method had found calcifications that she was not 
able to see because of their reduced size (micro-calcifications). Our next step is to 
implement the CAD system using the process discussed in figure 2b and this will 
allow us to compare our method’s accuracy with the physician’s accuracy for the 
prediction of calcifications in mammograms. We also want to try other algorithms 
such as Bayesian networks to see if we can improve the efficiency achieved with the 
backpropagation and decision trees algorithms. 
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Abstract. We present an optimization algorithm that combines active 
learning and locally-weighted regression to find extreme points of noisy 
and complex functions. We apply our algorithm to the problem of 
interferogram analysis, an important problem in optical engineering 
that is not solvable using traditional optimization schemes and that 
has received recent attention in the research community. Experimental 
results show that our method is faster than others previously presented 
in the literature and that it is very accurate for the case of noiseless 
interferograms, as well as for the case of interferograms with two types 
of noise: white noise and intensity gradients, which are due to slight 
missalignments in the system. 

Keywords: Optimization, active learning, instance-based learning, 
locally-weighted regression 



1 Introduction 

Optimization in poorly modelled, noisy and complex domains is an important 
problem that is faced in many scientific and engineering areas. In such do- 
mains, traditional optimization algorithms, such as the Simplex method [1] or 
the Levenberg-Marquardt algorithm [2], do not yield satisfactory results and 
are usually very sensitive to the starting search points provided by the user. 
For these reasons, non-traditional optimization algorithms, including simulated 
annealing [3], genetic algorithms [4,5], evolution strategies [6,7] and hybrid 
evolutionary-classical algorithms [8], have been proposed. While good results 
have been reported, the running times of these algorithms are often high, and 
they are not well suited to all domains. Thus, more efficient and complementary 
algorithms are desirable. 

In this paper we propose an efficient algorithm to perform optimization in 
complex domains. Our algorithm is based on the observation that the candidate 
solutions generated by an optimization algorithm, which are normally discarded 
by both traditional and non-traditional schemes, can be used as a training set 
for a learning algorithm, which in turn can predict the parameters of an optimal 
solution to the problem. An advantage of this approach is that, if we want to 
find the solutions to several similar problems, we can process them concurrently 
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and integrate all the candidate solutions in a single training set. Since the train- 
ing set is continuously changed, we need a learning algorithm that requires a 
small training time. Instance-based learning algorithms, whose training consists 
of simply storing the training data, fulfill this requirement. In our work we used 
the locally- weighted regression algorithm [9], an instance-based learning algo- 
rithm that has been found to yield similar accuracy as neural networks in many 
application domains while preserving the short training times inherent to this 
class of methods. 

We illustrate our method with an application to the problem of interferogram 
analysis, which has received recent attention in the literature. This is an inter- 
esting domain, as there are published attempted solutions using both traditional 
optimization schemes and evolutionary algorithms [10,11]. 

The organization of the remainder of this paper is as follows. In Section 
2 we describe the proposed optimization algorithm, the main contribution of 
this paper. Section 3 presents background material about interferometry, the 
application area we use to illustrate the algorithm. Section 4 gives details about 
the adaptation of the algorithm to the problem. Section 5 shows the main results, 
and Section 6 presents conclusions and suggests directions for future work. 

2 Outline of the Optimization Algorithm 

We are interested in the problem of finding the parameters of a known analytic 
function that best match an observation. Let o be the observed (multidimen- 
sonal) variable, let f(x) be a function with the same dimensionality as o. The 
goal of the optimization procedure is to obtain the value of x that minimizes 
jo — f(x)|. Typically, this problem is solved by an iterative process: in iteration 
i we generate Xj, evaluate the target function jo — f(xi)| and based on the value 
of the target function, as well as its first and second derivatives (if they are 
available), we generate the next candidate value Xi+i, which is expected to be 
closer to the optimum. 

This work deals with the problem where we have several observations Oi, . . . , 
o„, and we want to find the vectors xi, . . . ,x„ that minimize the errors Cj = 
joi — f(xi)|. Clearly, this can be solved by solving the n optimization problems 
separately. However, we propose a method to solve the problem more efficiently, 
posing it as a learning problem, where a learning algorithm learns the inverse 
function f~^(x). The training set used by the algorithm is formed by the pairs 
of values (f(xi),Xi) previously generated in the search, its test set consists of the 
values Oi, . . . ,o„ and it outputs an estimate of xi, . . . ,x„ that is expected to 
minimize Ci, . . . e„. When a new set of solutions xi, . . . ,x„ is proposed by the 
algorithm, we compute their corresponding f(xi ),..., f(x„) and use the new 
pairs (f(xi),Xj) to augment the training set, and continue this iterative process 
until convergence is attained. Since this type of active learning adds to the 
training set examples that are progressively closer to the points of interest, the 
errors are guaranteed to decrease in every iteration. The outline of the algorithm 
can be described by the following pseudocode: 
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1. Generate randomly an initial set of values and compute their 

corresponding f(xi), . . . , f(xm). 

2. Let R = {(f(xi), xi), . . . , (f(xm), Xm)} be the initial training set. 

3. Let T = oi, . . . ,On be the test set. 

4. While T is not empty 

a) Train an approximator A using R as training set 

b) For each Oj € T 

i. Generate A(o;) 

ii. i? = i?U{(f(A(oi)),A(oi))} 

iii. If |oi-f(A(oi))| < threshold, remove Oi from T. 

Here, A(o;) can be seen as the best current guess for the value of x such 
that f(x) = Oi. For the algorithm to be efficient, we need to minimize the time 
taken to train A. This can easily be done if we use an instance-based learning 
algorithm, such as the locally-weighted regression algorithm (explained in the 
next subsection). The other potentially time consuming step is the application 
of A to compute A(o;), as it normally takes time proportional to the size of 
the training set to find the nearest neighbors of each example in the test set. 
However, since the algorithm is applied repeatedly to the same test set, we can 
cache the nearest neighbors of each example in the test set, and every time the 
training set is augmented (step 4.b.ii) we can check if the example added to the 
training set becomes a nearest-neighbor of any of them. 

2.1 Locally- Weighted Regression 

Locally- Weighted Regression (LWR) belongs to the family of instance-based 
learning algorithms. In contrast to most other learning algorithms, which use 
their training examples to construct explicit global representations of the tar- 
get function, instance-based learning algorithms simply store some or all of the 
training examples and postpone any generalization effort until a new instance 
must be classified. They can thus build query-specific local models, which at- 
tempt to fit the training examples only in a region around the query point. In 
this work we use a linear model around the query point to approximate the 
target function. 

Given a query point Xq, to predict its output parameters yq, we find the k 
examples in the training set that are closest to it, and assign to each of them 
a weight given by the inverse of its distance to the query point: Wi = . i . 
Let W, the weight matrix, be a diagonal matrix with entries wi, . . . , Let X 
be a matrix whose rows are the vectors xi, . . . , x^, the input parameters of the 
examples in the training set that are closest to Xq, with the addition of a “1” 
in the last column. Let T be a matrix whose rows are the vectors yi, . . . ,yk, 
the output parameters of these examples. Then the weighted training data are 
given by Z = FFA and the weighted target function is R = WY. Then we use 
the estimator for the target function yq = Xq^(Z^Z)“^Z^R. 

Thus, locally weighted linear regression is very similar to least-squares linear 
regression, except that the error terms used to derive the best linear approxima- 
tion are weighted by the inverse of their distance to the query point. Intuitively, 
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this yields much more accurate results than standard linear regression because 
the assumption that the target function is linear does not hold in general, but 
is a very good approximation when only a small neighborhood is considered. 

3 Interferometry 

Interferometry is a laboratory technique very commonly used to test the quality 
of optical systems. To perform interferometry, two beams, one passing through 
a reference surface and the other passing through the test surface, are combined 
and made to interfere, which results in a pattern, called interferogram, that 
characterizes the quality of the test surface. A schematic diagram of a simple 
interferometer is shown in figure 1. Experienced technicians can diagnose the 
flaws of the test surface by careful analysis of the interferogram, however, this 
is a time consuming task, and when there is a need to analyze more than a few 
interferograms, it becomes impractical. Thus, there is a need for techniques to 
automate this process. 

The problem of automatically characterizing an interferogram has received 
recent attention in the literature. This is a difficult problem, and traditional opti- 
mization schemes based on the least-squares method often provide inconclusive 
results, specially in the presence of noisy data [12,13]. For this reason, non- 
traditional optimization schemes, such as evolutionary algorithms, have been 
proposed to solve this problem [11]. While evolutionary algorithms provide very 
accurate results in the case of both noiseless and noisy data, their running time is 
high, taking several minutes to analyze a single interferogram. Clearly, if we need 
to analyze a large number of interferograms, this approach becomes unfeasible. 



Reference 

Mirror 




Fig. 1. A simple interferometer 
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3.1 Interferogram Simulation 

To obtain the simulated interferograms we use Kingslake’s formulation [14], 
where the intensity in the interferogrametric image is given by 

Ott 

I{x,y) = cos{— W{x,y)) + G{x,y) + N{x,y) (1) 

where A is the wavelength of the light source, N is white noise which we rep- 
resented as a random number obtained from a Gaussian distribution with zero 
mean, G is a noise term due to slight missalignments in the system and appears 
in the image as an intensity gradient and can be defined by three parameters 
71,72 and 73 : 

G(x, y)=-fi + j2X + 732/ (2) 

W{x,y) is the optical path difference (OPD) between the reference and test 
surfaces, and it is represented by a polynomial using the Seidel aberration for- 
mulation: 

W{x,y) = A{x^ + y‘^)‘^ + By{x^ + y“^) + C{x^ + 3y^) + D{x^ + y“^) + Ey + Fx (3) 

where A is the spherical aberration coefficient, B, C and D are the comma 
coefficient, astigmatism and defocusing coefficients, respectively, E is the tilt 
about the y axis, and F is the tilt about the x axis. 

Clearly, it is easy to obtain an interferogram I given the vector of aber- 
ration coefficients v = [A, B, G, G, E, F] and the vector of intensity gradients 
7 = [ 71 , 72 , 73 ]. However, we are interested in the inverse problem, that is, ob- 
taining the vector of aberration coefficients and intensity gradient from the cor- 
responding interferogram, which, as mentioned before, is a very difficult opti- 
mization problem. The following section will describe our proposed solution to 
this problem. 



4 Automated Interferogram Analysis 

The problem of finding the parameters that characterize an optical system is 
known as interferogram analysis. In this section we show the application of our 
proposed optimization method to the problem of interferogram analysis. 



4.1 Preprocessing 

The input to our system is a set of interferograms and the output is a vector of 
aberration and gradient coefficients that characterize them. Before we apply the 
optimization algorithm, we can greatly reduce the dimensionality of the learning 
task using a principal component analysis preprocessing stage to compress the 
high-dimensional interferograms into a more manageable size with minimal loss 
of information. 
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Principal Component Analysis. The formulation of standard PCA is as 
follows. Consider a set of m vectors vi,V 2 , . . . , v^, where the mean object of 
the set is defined by 







( 4 ) 



Each object differs from the mean by the vector 



0; = Vi - /r 



( 5 ) 



Let A= [ 01 , 6 * 2 , • ■ • , 9m]- C, the covariance matrix, is given by 

m m 

c' = EE^*^J = (6) 

i=i 

The principal components are then the eigenvectors of C. If we sort these 
eigenvectors by decreasing order of their corresponding eigenvalues, a projection 
onto the space defined by the first k eigenvectors (1 < fc < m) is optimal with 
respect to information loss. That is, let P be the matrix whose columns are the 
first k eigenvectors of C, then the optimal projection of v, is given by 

Pi = (7) 



4.2 Optimization Algorithm 

The algorithm to perform automated interferogram analysis can be described 
as follows. First we generate randomly k parameter vectors x„. . . . ,Xk, where 
Xi = [Ai, Di, Ei, Fi, 7 ii, 72 i, 73 i] contains aberration and gradient coeffi- 

cients. For each x, we construct the corresponding interferogram /(xi) applying 
equations 1, 2 and 3. Then we perform PCA on a matrix J = [/(xi), . . . , /(xk)], 
obtaining P, the matrix of principal components, and fx, the mean vector, as de- 
scribed in 4.1. The projection of each interferogram into the eigenspace is then 
given by pi = P^(/(xi) — /x). We can now use the set of pairs R{pi, Xi) as initial 
training set to the algorithm. 

Given a set of test Ti, . . .Tm interferograms, we first project them to the 
eigenspace created in the previous step, ti = P"^ {Tt — fx), and we give these 
projections as the test set to the learning algorithm described in Section 2. 

5 Experimental Results 

In this section we describe the experiments performed with our optimization 
algorithm applied to the problem of predicting the vectors of aberration coeffi- 
cients and intensity gradients. First, we generated a thousand aberration vectors, 
a thousand intensity gradient vectors and their corresponding interferograms, 
using an 81 by 81 resolution. For this experiment we dealt with noiseless inter- 
ferograms (that is, the noise terms G and N were set to zero). Using principal 
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Table 1. Mean Absolute Errors for Simulated Interferograms 



Coefficients 


Mean Absolute 
Error 


Standard 

Deviation 


A 


0.0011 


0.0010 


B 


0.0015 


0.0011 


C 


0.0006 


0.0005 


D 


0.0010 


0.0009 


E 


0.0002 


0.0002 


F 


0.0004 


0.0005 



Table 2. Mean Absolute Errors for Noisy Simulated Interferograms 



Coefficients 


Mean Absolute 
Error 


Standard 

Deviation 


A 


0.1109 


0.1525 


B 


0.0887 


0.0829 


C 


0.0398 


0.0697 


D 


0.0591 


0.0506 


E 


0.0250 


0.0526 


F 


0.0463 


0.0800 


71 


0.0080 


0.0023 


72 


0.0100 


0.0024 


73 


0.0030 


0.0010 



component analysis we reduced the dimensionality of the task, keeping 47 eigen- 
vectors, which preserve about 95% of the information in the original data. Then 
we randomly divided the data into ten equally sized subgroups, one group was 
used for testing and the remainder nine were considered the training set . Ten dif- 
ferent experiments were performed, each one using a different group for testing. 
We repeated this procedure ten times, and the overall average are the results 
presented here. Table 1 shows averaged mean absolute errors and standard de- 
viations for each aberration coefficient. As the experimental results show, our 
method is very accurate with the simulated interferograms. On average, each 
interferogram took 1.6 seconds to process, which is much faster than the results 
reported in recent works dealing with the same problem. For example, [11] re- 
ports that evolution strategies took about 3 minutes to find the parameters of 
each interferogram, using the same resolution and similar computing hardware. 

Real data always pose the challenge of managing noise. In order to evaluate 
the noise sensibility of our method, we performed experiments on interferograms 
with simulated noise, using both Gaussian noise and an intensity gradient, as de- 
scribed in Section 3. Table 2 shows errors in aberration coefficients and intensity 
gradients. In Figure 2 we can see a visual comparison between noisy interfero- 
grams and the interferograms obtained from the predicted aberrations. It can be 
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abode 



Fig. 2. The top row shows five of the interferograms used for testing and the bottom 
row shows the best matches found by the algorithm. Columns a and b show noiseless 
interferograms; it can be seen that the matches found by the algorithm are virtually 
undistinguishable from the test interferograms. Columns c, d and e show noisy inter- 
ferograms; the matches found by the algorithm show almost identical interferograms, 
except that the noise has been removed. 




abode 

Fig. 3. A detailed trace of the optimization algorithm using a noiseless interferogram as 
input. The test interferogram is shown in a; h shows the result of applying LWR using 
only the original (randomly generated) training data. Figures c and d show successive 
approximations given by the algorithm as it iterates toward a solution, finally, the best 
result found is shown in e. 



seen that our method’s performance was not damaged by the noise in the test 
data. For this case, due to the fact that the parameter space is increased, the 
running time is also increased, taking 9 seconds to find the optimal parameters 
of each interferogram, on average. As this is, to the best of our knowledge, the 
first attempt to approximate interferograms using a noise model that is more 
complex than simple Gaussian noise, we cannot compare our results with previ- 
ous approaches, however, the running time is still much smaller than that taken 
by the method that only deals with noiseless interferograms. 

In Figure 3 we present a detailed execution trace of the algorithms, using 
a randomly chosen noiseless test example. We can see how the algorithm is 
gradually converging to a set of parameters that generate an interferogram that is 



250 



O. Fuentes and T. Solorio 




a b c d e 

Fig. 4. A detailed trace of the optimization algorithm using a noisy interferogram as 
input. The test interferogram is shown in a; b shows the result of applying LWR using 
only the original (randomly generated) training data. Figures c and d show successive 
approximations given by the algorithm as it iterates toward a solution, finally, the best 
result found is shown in e. 



virtually undistinguishable from the test interferogram. Figure 4 shows a similar 
trace, except that the input is now a noisy interferogram. The output of the 
algorithm is again an almost exact match to the test interferogram, except that 
the noisy has been eliminated. 

6 Conclusions 

In this paper we have presented an optimization algorithm that has a very strong 
feature: the ability of extending the training set automatically in order to best fit 
the target function for the test data. There is no need for manual intervention, 
and if new test instances need to be classified the algorithm will generate as 
many training examples as needed. 

We have shown experimental results of the application of our method to solve 
the problem of, given a large set of interferograms, finding their corresponding 
vectors of aberration coefficients. The method yields very accurate results, even 
in the presence of noise, and also, it is faster by two orders of magnitude than 
other methods introduced earlier. 

Present and future work includes: 

— Testing the method using real interferograms. 

— Extending the algorithm to handle higher-order aberrations. 

— Testing the applicability of the method to other optimization problems in 
optics, as well as in other areas of science. 
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Abstract. In this paper we make use of a modified Grid Based Fuzzy 
System architecture, which may provide an exponential reduction in the 
number of rules needed. We also introduce an algorithm that automat- 
ically, from a set of given I/O training points, is able to determine the 
pseudo-optimal architecture proposed as well as the optimal parameters 
needed (number and position of membership functions and fuzzy rule 
consequents). The suitability of the algorithm and the improvement in 
both performance and efficiency obtained are shown in an example. 

1 Introduction 

The estimation of an unknown model from a set of input/output data is a crucial 
problem for a number of scientific and engineering areas where lots of research 
efforts have been employed on. The objective is to obtain a model from which 
to obtain the expected output given any new input data. Regression or function 
approximation problems deal with continuous input/output data while classifi- 
cation problems deal with discrete, categorical output data. In this paper we are 
concerned with function approximation problems in which we want to obtain the 
model that approximates better the desired continuous output given any input 
data. 

Several authors have worked with fuzzy systems to deal with the problem of 
function approximation. One of the first studies in this context was carried out 
by Wang and Mendel [1] presenting a general method for combining numerical 
and linguistic information into a fuzzy rule-table. A procedure was proposed in 
which each datum generates a rule, though this approach produces an enormous 
number of rules when the input data set is considerable. Other approaches have 
also attempted to solve function approximation problems by means of cluster- 
ing techniques [2,9]. In general, two main approaches might be taken for the 
partitioning of the input space: 

On the one hand, the use of fuzzy clusters (see fig la)) performs a marginal 
subdivision of the input space depending obviously on the number of rules taken 
to reach the objective. This approach has the disadvantage that the whole input 
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Fig. 1. a) Clustering Techniques for function approximation, b) Grid techniques for 
function approximation 



space might not be covered properly. Some input space regions might be kept 
uncovered by any rule. Besides, the use of clustering for function approximation 
problems generally does not take into account the interpolation properties of the 
approximator system [3]. 

On the other hand, Grid-Based Fuzzy Systems (see fig 16)) provide a thor- 
ough coverage of the whole input space which makes them especially well-suited 
for low-dimension function approximation problems. Several previous works and 
papers have shown the great performance that might be reached using this kind 
of partitioning of the input space. Nevertheless in this last approach, the num- 
ber of rules used by the fuzzy system increases exponentially with the number 
of input variables and with the number of membership functions per variable. 
This increase derives in a loss of effectiveness and in a loss of one of the main 
properties of the fuzzy systems, the understanding and interpretability of the 
system. 

In this paper we use a very simple structure to overcome the problem of 
the curse of dimensionality for Grid-Based Fuzzy Systems. Apart from present- 
ing this simple and convenient sort of fuzzy systems, we also will provide an 
algorithm that, when possible, will select the group of variables that will form 
each sub-grid, resulting to a MultiGrid structure. Also once we know the hard- 
structure of our multigrid system, we will provide an adaptative algorithm to 
select the optimal parameters and fine-structure of the system, to obtain the 
final optimal function approximator for the given data set. 



2 MultiGrid-Based Fuzzy System (MGFS) Architecture 

When we have a high number of input variables, a N-dimensional grid might 
seem useless for our aim of obtaining an approximation of the input points, since 
having too many rules as well as too many antecedents on each rule, results in 
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an incomprehensible huge model. Also managing so many parameters may reach 
an efficiency bottleneck, resulting impossible to optimize. 

Now considering a high dimensional space, we propose to use systems in the 
form [4]: 




Fig. 2. MultiGrid-Based Fuzzy System (MGFS) 



Each group of variables are used to define a Grid-Based Fuzzy System 
(GBFS) from which a set of rules is obtained in the form [5]: 

IF a;i is AND . . . AND xn is THEN Af = (1) 

being the i — th rule of the p—th GBFS. Thus, all the rules from all the GBFS 
form the whole MGFS, whose output is obtained by normalizing according to 
the number of GBFS. Therefore the final output of the system for any input 
value X = (xi,X 2 , ■ ■ ■ ,xn), can be expressed as follows: 

p Rp Np _ 

E E -Rj n 

p=l j — 1 m—1 

p Rp Np 

E E n 

j — 1 m—1 



F{x, MF, R, C) 



(2) 
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c) 



Fig. 3. MGFS Different topologies a) one simple topology considers one GBFS 
per variable, therefore having simple rules with only one antecedent, one for each 
membership function of each variable, b) a high number of more complex topologies 
might take place, here we present a two-GBFS topology; the first GBFS has variables 
xi, X 2 while the second GBFS has variables X 2 , xs, X 4 - See how one single variable 
might appear in several GBFS with different membership functions distribution, c) the 
most expensive topology has a single GBFS for all the variables. The number of rules 
here might be too high in terms of interpretability and efficiency. 



where explicit statement is made on the dependency of the output function 
with the structure of membership functions of the system MF, with the conse- 
quents of the whole set of rules R, and with the hard structure of the system 
C = {xf,x^,...x^p}}, i.e, the input 

variables entering each individual GBFS. 

Several architecture forms are therefore possible for any given problem with 
a set of input variables (see Fig.3). The simplest case is when each variable 
forms a single set (maybe some variables are even not present if they don’t have 
influence on the output of the system), then each rule on each set of variables 
has a single antecedent. 

Many more complex configurations are possible for all the combinations (per- 
mutations on the number of input variables) until keeping only one set of the 
whole number of input variables, that is the case of having a (single) grid based 
fuzzy system (GBFS). 
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Now that we have an architecture that, when possible, might reduce the 
number of rules exponentially, we will study how we can calculate the subjacent 
data model structure to group the variables and form the optimal MultiGrid- 
Based Fuzzy System (MGFS). 

3 Hard-Structure Identification 

In this section we present a very effective algorithm to determine the GBFSs 
that will comprise the system hard-structure, the final MGFS, as shown in Fig.2. 
Notice the high difficulty to guess the GBFSs that could form the structure of 
the system. For 4 variables for example, 4 GBFSs of 1 variable -I- 6 GBFSs 
of two variables -I- 4 GBFSs of three variables (-1-1 GBFS of one variable) are 
candidate elements to form the structure. Now we would have to choose from 
every possible grouping of these 15 GBFSs, which one perform best with the less 
number of rules possible to form the final MGFS, giving thousands of possible 
combinations even for this simple problem. 

To tackle this problem, a Top-Down algorithm is presented now. It starts 
from a whole, complete and effective, grid fuzzy system, proceeding to decrease 
its complexity step by step while possible. Then it goes step by step, building a 
simpler MGFS each time, leaving this optimal number of membership functions 
per variable, and recalculating the consequents of new rules. On each step of the 
algorithm, if the error obtained is ’’similar” to the more complete GBFS previous 
one, we will take the new simpler configuration as the chosen one. Similarly here 
we mean that the error does not increase over a tolerance level. If the error 
obtained is higher, another alternative will be chosen. This will go working until 
no simpler GBFS can be obtained without keeping the error level. The detailed 
algorithm is presented now: 



Top-Down Algorithm 

1: Initialize the fuzzy system with a complete grid, setting optimal number of 
membership functions per input variable. 

2: while further steps can be performed do 

3: NumberOfGroups = the number of variable groups in this moment 

4: for I = TNumberOfGroups do 

5: Decompose the group ‘F into all the possibilities having one variable 

less and add them temporary to the group of variables, taking away the 
group ‘F 

6: Take away temporary any groups included in other one bigger. 

7: Evaluate the system configuration and take the overall error. 

8: if the new error ^ previous error -|- tolerance, then 

9: Make definitive the previous decomposition 

10: for J = l:NumberOfNewAddedSubGroups do 

11: Take away the subgroup J temporary 

12: Take away temporary any groups included in other one bigger 
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13: Evaluate the system configuration and take the overall error. 

14: if The new error < previous error + tolerance then 

15: Make definitive the elimination of subgroup J. 

16: else 

17: Undo the previous elimination of subgroup J. 

18: end if 

19: end for 

20: else 

21: Undo the previous decomposition. 

22: end if 

23: end for 

24: end while 

25: Return the Final optimal configuration for the MultiGrid-Based Fuzzy Sys- 
tem. 

In steps 7 and 13 we EVALUATE by optimizing the consequents and evaluating 
the error using the data set of input/output points. The system configuration is 
obtained by taking the number of membership functions per input variable, set- 
ting the membership functions equally-distributed on each variable input domain 
and forming the rules for each sub-grid. Considering the data set D, we perform 
a Least Square Error (LSE) algorithm to optimize the rules consequents [6] . 

The well-known expression for the square error given, the data set D, the 
distribution of membership functions MF, the rule consequents i?, and the MGFS 
configuration C, is: 

J(U, MF, R,C) = J2 iVk - F{x, MF, R, C)f (3) 

x^D 



Differentiating J over each rule consequent give us a lineal equations sys- 
tem with R parameters and R equations. This procedure to calculate the rule 
consequents is independent of the form and distribution of the membership func- 
tions. Singular Value Decomposition (SVD) will be the method used to solve 
the equations system [7]. Due to the high redundancy that might appear in the 
system equations matrix, this method suits fine for our problem. 

Once the rule consequents have been optimally calculated, the error of 
the MGFS will be measured using the Normalized Root- Mean- Square Error 
(NRMSE) [6]. 

Alter applying the whole algorithm we will have the pseudo-optimal structure 
of groups of variables. Now it remains to perform a final parameter tuning to 
have the system completely fitted according to the dataset D. 

4 Parameter Tuning 

Now that we have the final MultiGrid structure, now let’s perform the parameter 
adjustment so that the error is completely minimized for a given membership 
function configuration. 
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For this, we will use a triangular partition configuration [5,9] and make use 
of the method presented in [6] . We have already explained how to calculate the 
consequents of the rules for a given MF configuration (number and position of 
the membership functions). Now let’s study how we can set the position of the 
centers of the MFs. 

A two-step algorithm is performed to optimize the position of the centers 
of the MFs. First an initialization is done to set the centers to pseudo-optimal 
values through a heuristic that we’ll explain now. Secondly a gradient-based 
methodology is performed to obtain the local minimum for the given initial 
configuration. 

The first step is an iterative process with another two phases for calculating a 
slope parameter for each center and adjust the centers. The objective of this step 
is to have at each side of each membership function the same amount of error 
according to the dataset D. In each iteration, for each center we calculate 
the value : 



/ 

e^ixk) 

keD 



\ 

Y 

k^D 



( 4 ) 



A positive value of the parameter means that the contribution of the left 
side of the MF to the error is higher than the right side one; therefore we would 
have to move the center of the MF to the left, and vice versa. 

Afterwards we perform the following movement of the centers: 



Aclr 






if* >0 
if < 0 



( 5 ) 



Here b is the active radius, which is the maximum variation distance and is 
used to guarantee that the order of the membership function location remains 
unchanged (a typical value is b=2); is the temperature which indicates how 
far the center will be moved. This step of the algorithm will work iteratively 
moving the centers until a balance takes place in the error on each side of each 
center. 

The last step is to find a local minimum from this initial configuration. For 
this purpose, it can be chosen any of the gradient-based methods that we can 
encounter in the literature (steepest descent, conjugate gradient, Levenberg- 
Marquardt algorithm, etc.) 

Now we have described a tool that for a given MultiGrid configuration and 
for a given membership function configuration, allows us to find pseudo-optimal 
parameter values for the whole MGFS. But, as in [8], here we will go one step 
further and try to optimize the number of membership functions associated with 
each input variable of each MGBS that forms the MGFS. This is the task that 
is accomplished in the next section. 
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5 Fine-Structure Identification 

We have explained two important phases for our approach for function approxi- 
mation. The MultiGrid structure algorithm has been presented, we have the key 
to reduce in many cases the complexity of our system exponentially. A parameter 
adjustment algorithm has also been obtained for a given MultiGrid structure and 
a fixed number of membership functions per input variable. Now let’s explain 
how we can adjust the number of membership functions per input variable ac- 
cording to a final error objective. Too complex systems might be useless though 
also give much less error, and too simple systems might not perform well enough. 
The algorithm we explain here will give us the last key to obtain the system that 
fits best to the NRMSE we want for our system with the less complexity possible. 

This part of the whole algorithm will work together with parameter tuning 
to try to obtain the simpler but more effective system according to a limit in 
the error that we will impose. The idea is to begin from a topology where all the 
variables in all the sub-grids begin for example with two membership functions 
per input variable. The parameter identification sub-algorithm is performed to 
check if the system in this moment fits the error goal. 

Step by step, we check which sub-grid and in which variable adding a new 
membership function decreases most the error. There we will add a new member- 
ship function, and will execute again the parameter identification sub-algorithm 
to check if the error goal has been passed. 

6 Simulations 

Now that we have all the tools for the whole method for function approximation, 
let’s execute the whole algorithm for a representative example: 

We will demonstrate how the proposed algorithm works with the following 
example taken from the literature [4,8]. 

F (xi, X2, a;3, xa) = lOsin ( 77 x 1 X 2 ) + OX3 -I- 5x4 + ? 

with xi, X2, X3, X4, G [0, 1] (6) 

We have 10.000 training points generated using this function and we will 
introduce a random error ^ with variance 0.1. We will first evaluate how the 
structure would be detected. The initial configuration will be a whole grid fuzzy 
system having 5 membership functions for each variable. We show in table 1, 
the steps that the algorithm would follow. 

Now from this execution we see how the algorithm goes discarding most of 
the possibilities, taking on each step the only possible configuration according 
to the stability of the training error. Notice that in fact, from the thousand of 
possibilities to form a MGFS for 4 input variables, only less than 15 possibili- 
ties need to be tested following the algorithm steps to get the optimal MGFS 
configuration. A final configuration having one grid of two variables (xi and X2) 
and one grid of one variable (X4) is taken for parameter adjustment. 
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Table 1. Trace and results for the Top-Down algorithm for the example 



Step of the algorithm 


Variable Groups 


NRMSE 


1 


{1,2, 3, 4} 


0.025 


2, 4, 5 


{2,3,4}, {1,3,4}, {1,2,4}, {1,2,3} 


0.025 


9, 11 


{1,3,4}, {1,2,4}, {1,2,3} 


0.025 


15, 11 


{1,2,4}, {1,2,3} 


0.025 


15, 11 


{1,2,3} 


0.406 


17, 11 


{1,2,4} 


0.025 


15, 2, 4, 5 


{2, 4}, {1,4}, {1,2} 


0.025 


9,11 


{1,4}, {1,2} 


0.025 


15,11 


{1,2} 


0.406 


17,11 


{1,4} 


0.715 


17, 2, 4, 5 


{1,2}, {4} 


0.025 


9,11 


{1,2} 


0.406 


17, 4, 5 


{4}, {1}, {2} 


0.435 


21, 2, 25 


{1,2}, {4} 


0.025 



Notice that this algorithm not only performs the groups’ selection but also a 
task of variable selection is accomplished. In the case where any of the variables 
does not affect the output of the system, it will be immediately detected and dis- 
carded, decreasing even more the complexity of our system for a fine parameter 
tuning, and final interpretability and usability of the resulting MGFS. 

Next let’s check the results for the parameter and fine-structure tuning. Set- 
ting a limit of 0.01 for the NRMSE, after applying the whole algorithm, the final 
number of membership functions needed per input variable is 6 for the vari- 
ables 1 and 2, (0 for the variable 3 that was already discarded by the algorithm 
of MGFS selection), and 2 for variable number 4. The algorithm works adding 
membership functions to the first two input variables and without adding anyone 
to the variable 4, since as noticed in reference [8], the lineal dependence of this 
variable is easily identified with two membership functions. 



7 Conclusions 

In this paper we have presented the utility of a MultiGrid-Based Fuzzy System 
(MGFS) architecture to reduce the complexity of a fuzzy system model when the 
number of input variables grow up. Besides, it has been presented an algorithm 
capable of finding a suitable MGBS topology together with the pseudo-optimal 
parameters defining it, in order to model the underlying system expressed from 
a set of given I/O data points. As parameters of the MGBS model, it is meant 
not only the position of the membership functions of every GBFS but also the 
optimal number of them for a given target accuracy error. Finally, the functioning 
of the method has been demonstrated through a simple but rather instructive 
example. 
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Abstract. This paper elaborates on two techniques, deconstruction and 
composition, to handle complex data in order to learn from it. We pro- 
pose typed higher-order logic as a suitable representation formalism for 
domains with complex structured data. Both techniques derive naturally 
from such framework. A naive sequential covering algorithm which uses 
both techniques is applied on well known learning datasets (simple and 
structured) to test them with good results. A further experiment on the 
change of knowledge representation is presented to showcase the robust- 
ness of our approach. 



1 Introduction 

Inductive learning focuses on techniques for supervised learning from examples. 
Traditionally, inductive learners have used the attribute-value language to repre- 
sent training examples and induced hypotheses. The relative simplicity of this 
attribute-value language representation allows the implementation of efficient 
learning systems. However, it also makes difficult to apply such systems to do- 
mains with complex structure such as molecular biology, where data is rich in 
structure and the structural information may provide clues essential in inducing 
insightful concepts. Although not designed primarily to overcome such prob- 
lems the learning paradigm of Inductive Logic Programming (ILP) [II] allows 
to tackle such domains. Since ILP is based on first order logic still some work 
must be done to represent highly structured data in order to learn from it. 

It can be argued that, from a knowledge representation point of view, it would 
be more desirable to be able to capture physical structures in the data with cor- 
responding abstract structures in its representation. For instance, to represent a 
molecule, which is a collection of connected atoms, as a graph; or a collection of 
figures, as a set. Then, if we can design learning algorithms capable of manipulat- 
ing such representations directly, it would be possible to apply machine learning 
directly and naturally to domains rich in structure. Clearly, the additional ex- 
pressiveness of the representation may lead to an increase of the search space, 
which must be constrained in some way. Furthermore, the emphasis should be 
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on gaining insight from the data. Because the quality of the induced knowledge, 
as measured by both accuracy and comprehensibility, is more important than 
processing time. 

We use these arguments as our motivation for using a typed higher-order logic 
for knowledge representation, and for upgrading existing learning model classes 
(e.g., decision tree induction, rule induction, etc.) to this richer representation. 
Here we elaborate on two techniques presented separately before [9,8] and how 
combined can be used to induce rules. Further details on the representation are 
in [3,4,2,12]. A decision tree learning algorithm based on it is described in [4,2] 
and a genetic programming system that uses this basis is described in [6] . 

The paper is organised as follows. Section 2 discusses knowledge represen- 
tation using typed higher-order closed terms. Section 3 briefly discusses decon- 
struction as a means to obtain characteristic features of examples. Section 4 
discusses in more detail predicate construction through function composition 
and briefly presents a naive sequential covering algorithm based on composi- 
tion and deconstruction. Section 5 presents experimental results and section 6 
presents an experiment on the importance of knowledge representation on the 
learning process and how our approach handles it. Finally, in section 7 some 
concluding remarks are presented. 



2 Typed Higher-Order Knowledge Representation 

In order to capture complex structures we need correspondingly complex abstract 
data structures or types. Our representation formalism, whose details are in [4], 
is expressive enough to allow the representation of arbitrary structures as closed 
terms of the corresponding type. Thus, all information is typed. 

We note here that, in this sense, our formalism is a natural extension of 
the attribute-value framework. Indeed, in the attribute-value language, each at- 
tribute has a type (e.g., nominal, real) and examples are tuples (another simple 
type) of constants drawn from the domains of the corresponding types. The ef- 
ficiency of learners using this simple representation is in large measure a direct 
result of its strong typing, as types act as constraints on the search space. Inter- 
estingly, ILP has focused on first-order representations implemented in Prolog 
which has risen the necessity to create an ad-hoc typing system. We believe that 
this approach, which has proved very valuable [13], still falls short of handling 
complex data in a straight forward manner. 

In contrast, we wish to make types an intrinsic part of first and higher-order 
representations. This enables the examination of possible relations among dis- 
tinct elements that have the same type whilst limiting the associated increase 
of the search space. Of course, it is possible to simulate lack of types by using 
the same type for all elements of a type (e.g., a tuple). We use the programming 
language Escher [7] as the vehicle for typed higher-order logic knowledge repre- 
sentation. Escher is an integrated functional and logic programming language, 
based on Church’s theory of types. The syntax of Escher coincides with that of 
Haskell for the functional subset and includes extensions for quantifiers and set 
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constructs. A formal account of Escher is beyond the scope of this paper. For 
our purposes here, it is sufficient to state that Escher implements the necessary 
computational machinery to handle standard abstract data types such as tuples, 
lists, sets, multisets, trees and graphs. In fact, one can in principle construct any 
arbitrary abstract data type with its associated structure and operations. 

As mentioned above, examples are represented as closed terms of the abstract 
data type that “best” captures the structure of the domain. We give a few 
examples to illustrate the approach trying to use standard mathematical (or 
familiar) notation here to improve readability. 

Example 1 Consider the Mushroom Database available at UCI^ . Each example 
describes a mushroom belonging to the Agaricus and Lipidota Family. We can 
represent examples as tuples using: 

type Mushroom = (CapShape, CapSurface, CapColor, Odor, Bruises) 

Note we are using a sample of all the attributes available. Base types can 
then be declared as follows.'^ 

data CapShape = Bell I Conical I Convex I Flat I Knobbed I Sunken 
data CapSurface = Fibrous I Grooves I Scaly I Smooth 
data CapColor = Brown I Buff I Cinnamon I Gray I Green I Red 
data Odor = Almond I Anise I Creosote I Fishy I Foul I None 
type Bruises = Boolean 

The following is an example of a mushroom. 
mushroom = (Convex, Fibrous, Red, None, False) 



Example 2 Consider Michaelsky East- West Challenge [10] involving trains 
made up of load-carrying cars. Each car has several attributes: shape, length, 
number of wheels, roof and kind. Each car carries a number of cargo objects as 
well. We can model such trains as a list of cars, i.e., 

type Train = [Car] 

Each car can in turn be represented by a tuple consisting of its attributes and 
its load, i.e., 

type Car = (Shape, Length, Wheels, Kind, Roof, Load) 

and each load as a tuple consisting of an object and number of elements, i.e., 
type Load = (Object, Number) 

The following is the first train going east. 

ftrain = [ (Rectangular, Long, 2, Open, None, (Square, 3)) , 

(Rectcuigular, Short ,2, Closed, Peaked, (Triangle, 1) ) , 
(Rectcuigular, Long, 3, Open, None, (Hexagon,!)) , 

(RectcUigular , Short , 2 , Open , None , (Circle , 1) ) ] 

Note that this representation has the advantage that all the information 
relevant to an example is localised but the potential disadvantage that some 
information may be repeated. 

^ http:/ /www. ics.uci.edu/~mlearn/MLRepository.html 

^ Note that the keyword data indicates the declaration of a type and the data con- 
structors of that type, whilst the keyword type indicates a type synonym. 
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3 Deconstruction 

Now that we have a language to represent structured data as abstract data types, 
we need mechanisms to manipulate this representation and use it in learning. 
We first consider the extraction of components or characteristic features from 
examples. 

In order to construct useful hypotheses from examples, we must be able to 
access their constituent parts in order to discover features relevant to a given 
classification task. The technique we use to obtain such components is called de- 
construction^ . We will briefly discuss it here and further details can be obtained 
in [9]. 

The objective of deconstruction is to take a single term and return the set 
of its constituent elements. Each abstract data type has, associated with it, a 
set of accessor functions, which essentially perform the inverse operation of its 
constructor. The user can provide as well any other type declaring its accesor 
functions. Given a term, deconstruction is applied recursively to it until a base 
type is encountered. The deconstruction process creates a set of tuple of the form 

{term, type, value, predicate) 

where the predicate asserts the membership of a sub-term to the term. To 
illustrate how deconstruction works, we show the deconstruction of the two 
examples of section 2. Note that there are no predicates when we are only using 
the accessor functions. 

Example 3 Some elements of the deconstruction of the sample mushroom term 
in Example 1 are: 

(vl, Mushroom, (Convex, Fibrous, Red, False, None), true) 
(projCapShape(vl) , CapShape, Convex, true) 

(projCapSurf ace (vl) , CapSurface, Fibrous, true) 

Example 4 Some elements of the deconstruction of the sample ftrain term 
in Example 2 are: 

(head(vl). Car, (Rectangular,Long,2,0pen,None, (Square, 3)) , true) 
(head(v3) , Car, (Rectangular, Short, 2, Closed, Peaked, (Triangle,!) , 
v3 == tail(vl)) 

(proj Shape (v4) , Shape, Rectauigular , v4 == head(v3)) 

Since a term is essentially a tree structure, there is a single deconstructor 
chain for each sub-term, i.e. a conjunction of predicates that link each sub-term 
to the top-level term. 

Example 5 The deconstructor chain of the suh-term Rectangular in Example 

4 is v3 == tail(vl) && v4 == head(v3) 

® In [9] the term decomposition was used. With hindsight, we realised this was not a 
good choice since it may lead to confusion with the notion of composition defined 
later. 
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4 Composition 

Composition was briefly presented in [8]. We give here further details. Using de- 
construction, we can show now how our higher-order framework allows complex 
conditions to be built on terms using composition. Composition is the (higher- 
order) function having signature (.) : (/3 — >■ 7) — >■ (a — >■ /3) — >■ (a — >■ 7) and 

defined by (f .g) (x) = f (g(x)) where a,P and 7 are types. 

In addition to its accessor functions (see section 3), each abstract data type 
also has a set of observer and modifier functions^, which permit the building 
of conditions on terms. The user provides, in general, observer and modifier 
functions to suit a particular application domain. There are, however, some 
general functions that can be supplied for basic types. For instance, the following 
functions for sets are provided 

— Size. The function card: {T} -> Int that returns the number of elements 
in a set. 

— Filter. The function filter: (T -> Bool) x [T] -> [T] that takes a 
predicate and a set as arguments and returns the set obtained from the 
original one by removing those items that do not satisfy the predicate. 

— Map. The function map : (T -> A) x [T] -> [A] that takes a function and 
a set as arguments and returns the set obtained from applying the function 
to all members of the set. 

Given a set of functions and a bound on how much to compose them we 
take each of the functions and try to compose it with another. Once we have 
performed all possible compositions we decrease the bound. If it reaches zero we 
stop, otherwise we start again with the augmented set and the new bound. 

Note the utility of our higher-order framework as some of these functions 
take predicates and/or other functions as parameters. By composing accessor, 
observer and modifier functions, it is possible to construct complex conditions 
on terms, as illustrated in the following example. 

Example 6 Consider the trains of Example 2. We can test whether a car car- 
ries at least one object with the composition 
(>0) . proj Number .pro j Load 
Note that this is equivalent using lambda notation to 

(\x -> projNumber (projLoad x)) > 0 

We can test whether a train has less than 3 cars with more than 2 wheels with 
the composition 

(<3) .length. (filter ((>2) .projWheels) ) 

Finally, assume that we are given the functions 
sum: [Int] -> Int 

map: (T -> S) x [T] -> [S] 

where sum adds up the values of the items in a list of integers and map takes a 

In [4], accessor, observer and modifier fnnctions are treated uniformly as transforma- 
tions. The distinction highlights the generality of the approach to arbitrary abstract 
data types 
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function and a list as arguments and returns the list obtained by applying the 
function to each item in the original list. Then, we can test complex conditions, 
such as whether the total cargo of a train is at least 10 objects with the compo- 
sition 

(>10) .sum. (map (projNumber.projLoad)) 

Finally, note that the building of new functions and predicates through 
composition is akin to the processes of constructive induction and predicate 
invention. 



4.1 ALFIE 

ALFIE, A Learner for Functions In Escher [8,9] is a sequential covering algorithm 
based on the concepts presented so far. It uses examples as beaming guides in the 
search for useful properties and induces concepts in the form of decision lists, i.e., 
if El then tl else if E2 then t2 else ... if En then tn else tO 
where each Ei is a Boolean expression and the tj’s are class labels. The class tO 
is called the default and is generally, although not necessarily, the majority class. 

ALFIE first uses composition as a pre-processing step to augment its original 
set of available functions. Given a set of seed functions and a depth bound, ALFIE 
constructs all allowable compositions of functions of up to the given depth bound. 

Then the algorithm uses the deconstruction set (see section 3) of the first 
example and from it finds the Ei with the highest accuracy (measured as the 
information gain on covering) . It then computes the set of examples that are not 
yet covered, selects the first one and repeats this procedure until all examples 
are covered. 

Examples are represented as closed terms and the type structure of the top- 
level term automatically makes available the corresponding accessor, observer 
and modifier functions. In addition to these, the user may provide a set of func- 
tions that can be applied to the components of the examples (e.g., see the func- 
tions sum and map in Example 6). Such functions represent properties that may 
be present in the sub-parts of the structure of the data, as well as auxiliary func- 
tions able to transform components. They constitute the background knowledge 
of the learner and may be quite complex. 



5 Experiments 

The main goal of the experiments was to test how deconstruction and compo- 
sition can be used to learn from a variety of problems. Therefore we performed 
experiments on well known datasets both simple and complex. 

The outcome was very satisfactory since results were on par with other learn- 
ing systems. Furthermore, even for well known datasets our approach came up 
with different, interesting answers. 
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Table 1. Results for Attribute- value Datasets 



Dataset 


CN2 


ALFIE 




Training 


Test 


Training 


Test 


Mushroom 


100.00% 


100.00% 


98.90% 


98.20% 


Iris 


97.00% 


94.00% 


97.00% 


94.00% 


Zoo 


100.00% 


82.40% 


100.00% 


97.00% 



Table 2. Results for Highly-structured Datasets 



Dataset 


Progol 


ALFIE 


Mutagenesis 


83.00% 


88.00% 


Mutagenesis SO 


67.00% 


83.00% 


Mutagenesis RU 


81.40% 


76.00% 


PTE 


72.00% 


73.90% 



5.1 Attribute- Value 

We used the learning system CN2 [5] since it produces a decision list making the 
comparison easier to the results provided by ALFIE. 

Table 1 presents the results obtained for attribute-value datasets. All of them 
can be obtained from UCI machine learning repository. They appear in the 
literature often as benchmarks for learning systems. We used the subset of UCI 
datasets included in the package MLC-I— 1-. They have the advantage that they 
have been split using the utility GenXVFILES 3-fold to produce approximately 
I of the examples that are used for training and the rest are used for testing. 

In the Zoo problem, accuracy is increased given that ALFIE produces con- 
ditions in which the value of the attribute hair must be equal to the value of 
the attribute backbone for crustaceans. This condition simply states the high 
correlation between the two attributes. Although it may seem strange it points 
out to interesting facts about the dataset overlooked by CN2. 

5.2 Complex Structure 

These are the kind of problems we are most interested in. We focused on two 
well known problems of molecular biology: Mutagenesis [13] and PTE [1]. We 
used the Progol system to benchmark against since the datasets we used have 
the knowledge representation ready for Progol. We performed experiments on 
three variants of Mutagenesis: plain, using structural information only and Mu- 
tagenesis with regression unfriendly data. In all cases a ten fold cross validation 
experiment was carried out. The results presented correspond to the average of 
the ten folds. Table 2 presents the results obtained for these datasets. 

We would like to stress the fact that we didn’t provide background knowl- 
edge that the Progol learning system had at its disposal. Further background 
knowledge about the rings in the molecule is provided in the Progol dataset. We 
did not include such information in ALFIE because we were more interested in 
exploring what was possible to learn just from the highly structured data. 
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It is interesting to note that the algorithm “invented” the negation as in the 
property 

card (filter \x -> (iselSCelemP x)) (atomSetP vl)) <= 0 
which indicates the absence of Sulphur atoms in a molecule. 

6 Knowledge Representation Change 

Our main motivation was to develop a framework to handle complex data in 
a straight forward manner. That is to say that the learning system should be 
able to cope with different knowledge representations that the user may come 
up with. The following experiment tests such scenario. 

The White King and White Rook versus Black King is a well known dataset 
in the machine learning community. It consists of examples of chess boards with 
the positions of the pieces on it. The learning task is to obtain rules that allow 
to determine when such a board has an illegal configuration. 

Although this is an attribute value problem, since examples can be expressed 
as tuples of six elements, our interest on it was to explore the impact on the 
learning system if the representation was to be different. 

The problem as first stated was simply a collection of predicates stating 
whether a board was illegal 

illegal (WKingRank , WKingFile , WRookRank , WRookFile , BKingRank , BRookFile ) . 

This was changed to a representation that has a board as three tuple each 
representing the position of a piece. In Prolog that was represented as 

illegal (Board) . 

whit eking (Board, WKingRank , WKingFile) . 
whit erook (Board , WRookRank , WRookFile) . 
blacking (Board , BKingRauik , BKingFile) . 

Additionally a predicate adjacent has been defined that is able to determine 
whether a piece is next to another. In Escher the representation is 

type PosWKing = (WKingRank, WKingFile) 

type PosWRook = (WKingRank, WRookFile) 

type PosBKing = (BKingRank, BKingFile) 

type Board = (PosWKing, PosWRook, PosBKing) 

Given this representation ALFIE search space contains elements such as 

((adjacent (fileP (whiteKingP vl)) (fileP (blackKingP vl))) 

&& ((fileP (whiteRookP vl)) == (fileP (blackKingP vl)))) 
((adjacent (fileP (whiteKingP vl)) (fileP (blackKingP vl))) 

&& (adjacent (rankP (whiteKingP vl)) (rauikP (whiteRookP vl)))) 

ALFIE tries to produce hypothesis involving the positions of the pieces straight 
away. In contrast Progol search space contains 
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[C:-65,65,71,0 illegal(A) wk(A,B,B).] 

[0:5,506,499,0 illegal(A) bk(A,B,C).] 

[C:-35, 176, 185,0 illegal(A) wr(A,B,C), bk(A,B,D).] 

[0:59,66,55,0 illegal (A) :- wr(A,B,0), B=4.] 

[0:51,66,55,0 illegal (A) :- wr(A,B,0), B=4, adj(0,0).] 

[0:81,74,57,0 illegal (A) :- wr(A,B,0), 0=6, adj(B,B), adj(0,0).] 
[0:4,506,499,0 illegal (A) :- wr(A,B,0), adj(B,B).] 

[0:41,57,49,0 illegal(A) :- bk(A,B,0), B=4.] 

It can be observed that because of the variables involved, Progol must try a 
large number of combinations. Moreover, there is no clear relation between the 
different variables and Progol attempts to find a solution. The search space is 
not bounded as in ALFIE by deconstruction. 

Solutions. The solution found by Progol with the 6-tuple representation con- 
tains the following predicates clauses. 

illegal (A, B,C,D,E,F) :-adj(E,A), adj(B,F). 
illegal (A, B,C,D,C,E) . 

For the structured problems the following clauses are found among other. There 
is no clear relation between both solutions. 

illegal(A) :-wk(A,B,C), wr(A,D,E), bk(A,F,C), adj(E,B), adj(E,F). 
illegal (A) :- wk(A,B,C), wr(A,D,E), bk(A,F,F), adj(E,D). 
illegal (A) :- wr(A,B,7), bk(A,C,D), adj(D,B). 

In the case of ALFIE we obtain for the unstructured case the first rule 

if (v2,v3,v4,v5,v6,v7) = vl && adjacent v3 v7 then illegal 

And for the structured case the first rule 

if adjacent (fileP (whiteKingP vl)) (fileP (blackKingP vl)) 
then illegal 

Note that both conditions are the same. 

7 Conclusion 

A different approach to handling highly-structured examples in inductive learn- 
ing has been presented. This approach relies on the complementary techniques 
of deconstruction and function composition. Deconstruction makes it possible to 
access the components of a structure and composition provides a powerful means 
of constructing complex conditions on both the components and the structure 
of the examples. These techniques allow to learn from highly structured data. 

A learning system, ALFIE, that uses both deconstruction and function com- 
position was described, together with experimental results. The results show our 
approach learning from both simple and highly structured data. Furthermore, we 
were able to test our approach coping with different knowledge representations. 
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Abstract. Multiclass classification using Machine Learning techniques 
consists of inducing a function /(x) from a training set composed of pairs 
(xi, j/i) where yi G {1, 2, . . . , fc}. Some learning methods are originally bi- 
nary, being able to realize classifications where fc = 2. Among these one 
can mention Support Vector Machines. This paper presents a compari- 
son of methods for multiclass classification using SVMs. The techniques 
investigated use strategies of dividing the multiclass problem into binary 
subproblems and can be extended to other learning techniques. Results 
indicate that the use of Directed Acyclic Graphs is an efficient approach 
in generating multiclass SVM classifiers. 



1 Introduction 

Supervised learning consists of inducing a function /(x) from a given set of 
samples with the form (xi,yi), which accurately predicts the labels of unknown 
instances [10]. Applications where the labels yi assume k values, with k > 2, are 
named multiclass problems. 

Some learning techniques, like Support Vector Machines (SVMs) [3], origi- 
nally carry out binary classifications. To generalize such methods to multiclass 
problems, several strategies may be employed [1,4,8,14]. This paper presents a 
study of various approaches for multiclass classification with SVMs. Although 
the study is oriented toward SVMs, it can be applied to other binary classifiers, 
as all the strategies considered divide the problem into binary classification 
subproblems. 

A first standard method for building k class predictors form binary ones, 
named one- against- all (lAA), consists of building k classifiers, each distingui- 
shing one class from the remaining classes [3]. The label of a new sample is 
usually given by the classifier that produces the highest output. 

Other common extension to multiclass classification from binary predictions 
is known as all-against-all (AAA). In this case, given k classes, k{k — l)/2 clas- 
sifiers are constructed. Each of the classifiers distinguishes one class Ci from 
another class Cj, with i ^ j. A majority voting among the individual responses 
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can then be employed to predict the class of a sample x [8]. The responses of 
the individual classifiers can also be combined by an Artificial Neural Network 
(ANN) [6], which weights the importance of the individual classifiers in the fi- 
nal prediction. Another method to combine such kind of predictors, suggested 
in [14], consists of building a Directed Acyclic Graph (DAG). Each node of the 
DAG corresponds to one binary classifier. Results indicate that the use of such 
structure can save computational time in the prediction phase. 

In another front, Dietterich and Bariki [4] suggested the use of error- 
correcting output codes (EGOG) for representing each class in the problem. 
Binary classifiers are trained to learn the “bits” in these codes. When a new 
pattern is submitted to this system, a code is obtained. This code is compared 
to the error-correcting ones with the Hamming distance. The new pattern is 
then assigned to the class whose error-correcting codeword presents minimum 
Hamming distance to the code predicted by the individual classifiers. 

As SVMs are large margin classifiers [15], that aim at maximizing the dis- 
tance between the patterns and the decision frontier induced, Allwein et al. [1] 
suggested using the margin of a pattern in computing its distance to the output 
codes {loss-based EGOG). This measure has the advantage of providing a notion 
of the reliability of the predictions made by the individual SVMs. 

This paper is organized as follows: Section 2 presents the materials and me- 
thods employed in this work. It describes the datasets considered, as well as the 
learning techniques and multiclass strategies investigated. Section 3 presents the 
experiments conducted and results achieved. Section 4 concludes this paper. 

2 Materials and Methods 

This section presents the materials and methods used in this work, describing 
the datasets, learning techniques and multiclass strategies employed. 



2.1 Datasets 

The datasets used in the experiments conducted were extracted from the UGI 
benchmark database [16]. Table 1 summarizes these datasets, showing the num- 
bers of instances ([] Instances), of continuous and nominal attributes (j) At- 
tributes), of classes (j] Glasses), the majority error (ME) and if there are missing 
values (MV). ME represents the proportion of examples of the class with most 
patterns on the dataset. 

Instances with missing values were removed from the “bridges” and “pos- 
operative” datasets. This procedure left 70 and 87 instances in the respective 
datasets. For the “splice” dataset, instances with attributes different from the 
base pairs Adenine, Gytosine, Guanine and Thymine were eliminated. The other 
values of attributes present reflects the uncertainty inherent to DNA sequencing 
processes. This left a total of 3175 instances. 

Almost all datasets have been pre-processed so that data had zero mean and 
unit variance. The exceptions were “balance” and “splice”. In “balance”, many 
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Table 1. Datasets summary description 



Dataset 


tt Instances 


H Attributes 
(cont., nom.) 


tt Classes 


ME 


MV 


Balance 


625 


4 (0, 4) 


3 


46.1% 


no 


Bridges 


108 


11 (0, 11) 


6 


32.9% 


yes 


Glass 


214 


9 (9, 0) 


6 


35.5% 


no 


Iris 


150 


4 (4, 0) 


3 


33.3% 


no 


Pos-operative 


90 


8 (0, 8) 


3 


71.1% 


yes 


Splice 


3190 


60 (0, 60) 


3 


50.0% 


no 


Vehicle 


846 


18 (18, 0) 


4 


25.8% 


no 


Wine 


178 


12 (12, 0) 


3 


48.0% 


no 


Zoo 


90 


17 (2, 15) 


7 


41.1% 


no 



attributes became null, so the pre-processing procedure was not applied. In the 
“splice” case, a coding process suggested in the bioinformatics literature, which 
represents the attributes in a canonical format, was employed instead [13]. Thus, 
the number of attributes used in “splice” was of 240. 



2.2 Learning Techniques 



The base learning technique employed in the experiments for comparison of mul- 
ticlass strategies was the Support Vector Machine (SVM) [3]. Inspired by the Sta- 
tistical Learning Theory [17], this technique seeks an hyperplane w • x -|- 6 = 0 
able of separating data with a maximal margin. 

For performing this task, it solves the following optimization problem: 

Minimize: ||w||^ 

Restrictions: yi{w ■ Xi + b) > 1 



where G K™, yt G {—1, 4-1} and i = 1, . . . , n. 

In the previous formulation, it is assumed that all samples are far from the 
decision border from at least the margin value, which means that data have to be 
linearly separable. Since in real applications the linearity restriction is often not 
complied, slack variables are introduced [5]. These variables relax the restrictions 
imposed to the optimization problem, allowing some patterns to be within the 
margins. This is accomplished by the following optimization problem: 



9 ” 

Minimize: ||w|| + C ^ 

j=i 



Restrictions: 



Ri>0 

\ y* (w • Xi -I- 6) > 1 - 



where (7 is a constant that imposes a tradeoff between training error and gene- 
ralization and the are the slack variables. 

The decision frontier obtained is given by Equation 1. 



X! • X 4- 6 

xigSV 



( 1 ) 
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where the constants ai are called “Lagrange multipliers” and are determined in 
the optimization process. SV corresponds to the set of support vectors, patterns 
for which the associated lagrange multipliers are larger than zero. These samples 
are those closest to the optimal hyperplane. For all other patterns the associated 
lagrange multiplier is null, so they do not participate on the determination of 
the final hypothesis. 

The classifier represented in Equation 1 is still restricted by the fact that it 
performs a linear separation of data. This can be solved by mapping the data 
samples to a high-dimensional space, also named feature space, where they can be 
efficiently separated by a linear SVM. This mapping is performed with the use of 
Kernel functions, that allow the access to spaces of high dimensions without the 
need of knowing the mapping function explicitly, which usually is very complex. 
These functions compute dot products between any pair of patterns in the feature 
space. Thus, the only modification necessary to deal with non-linearity is to 
substitute any dot product among patterns by the Kernel product. 

For combining the multiple binary SVMs generated in some of the expe- 
riments, Artificial Neural Networks (ANNs) of the Multilayer Perceptron type 
were considered. These structures are inspired in the structure and learning 
ability of a “biological brain” [6]. They are composed of one or more layers of 
artificial neurons, interconnected to each other by weighted links. These weights 
codify the knowledge of the network. This weighted scheme may be an useful 
alternative to power the strength of each binary SVM in the final multiclass 
prediction, as described in the following section. 

2.3 Multiclass Strategies 

The most straightforward way to build a k class multiclass predictor from binary 
classifiers is to generate k binary predictors. Each classifier is responsible to 
distinguish a class Ci from the remaining classes. The final prediction is given 
by the classifier with the highest output value [3]. This method is called one- 
against-all (lAA) and is illustrated in Equation 2, where i = l,...,k and 
represents the mapping function in non-linear SVMs. 

/(x) = max(wi • 4>(x) -|- bi) (2) 

i 

Other standard methodology, called all-against-all (AAA), consists of buil- 
ding k{k — l)/2 predictors, each differentiating a pair of classes Ci and cj, with 
i ^ j. For combining these classifiers, a majority voting scheme (VAAA) can be 
applied [8]. Each AAA classifier gives one vote for its preferred class. The final 
result is given by the class with most votes. 

Platt et al. [14] points some drawbacks in the previous strategies. The main 
problem is a lack of theory in terms of generalization bounds. To overcome 
this, they developed a method to combine the SVMs generated in the AAA 
methodology, based on the use of Directed Acyclic Graphs (DAGSVM). The 
authors provide error bounds on the generalization of this system in terms of 
the number of classes and the margin achieved by each SVM on the nodes. 
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A Directed Acyclic Graph (DAG) is a graph with oriented edges and no 
cycles. The DAGSVM approach uses the SVMs generated in an AAA manner in 
each node of a DAG. Gomputing the prediction of a pattern using the DAGSVM 
is equivalent to operating a list of classes. Starting from the root node, the 
sample is tested against the first and last classes of the problem, which usually 
corresponds to the first and last elements of the initial list. The class with lowest 
output in the node is then eliminated from the list, and the node equivalent to 
the new list obtained is consulted. This process proceeds until one unique class 
remains. Figure 1 illustrates an example where four classes are present. For k 
classes, k — 1 SVMs are evaluated on test. Thus, this procedure speeds up the 
test phase. 




Fig. 1. (a) DAGSVM of a problem with four classes; (b) illustration of the SVM 
generated for the 1 vs 4 snbproblem [14] 



This paper also investigates the use of ANNs in combining the AAA predic- 
tors. The ANN can be viewed as a technique to weight the predictions made by 
each classifier. 

In an alternative multiclass strategy, Dietterich and Bariki [4] proposed the 
use of a distributed output code to represent the k classes in the problem. For 
each class, a codeword of length I is assigned. These codes are stored on a matrix 
M G { — 1, The rows of this matrix represents the codewords of each class 

and the columns, the I binary classifiers desired outputs. A new pattern x can 
be classified by evaluating the predictions of the I classifiers, which generates a 
string s of length 1. This string is then compared to the rows of M. The sample 
is assigned to the class whose row is closest according to some measure, like 
the Hamming distance [4]. Gommonly, the size of the codewords has more bits 
than needed to represent each class uniquely. The additional bits can be used 
to correct eventual classification errors. For this reason, this method is named 
error-correcting output coding (EGOG). 
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Allwein et al. [1] points out that the use of the Hamming distance ignores the 
loss function used in training, as well as confidences attached to the predictions 
made by each classifier. The authors claim that, in the SVM case, the use of the 
margins obtained in the classification of the patterns for computing the distance 
measure can improve the performance achieved by ECOC, resulting in the loss- 
based ECOC method (LECOC). Given a problem with k classes, let M be the 
matrix of codewords of lengths I, r a, label and /i(x) the prediction made by 
the f-th classifier. The loss-based distance of a pattern x to a label r is given by 
Equation 3. 

i 

dM(r,x) = y^max{(l - M(r, f)/i(x)), 0} (3) 

i=l 

Next section presents the experiments conducted using each of the described 
strategies for multiclass classification. 

3 Experiments 

To obtain better estimates of the generalization performance of the multiclass 
methods investigated, the datasets described in Section 2.1 were first divided 
in training and test sets following the 10-fold eross-validation methodology [10]. 
According to this method, the dataset is divided in ten disjoint subsets of appro- 
ximately equal size. In each train/test round, nine subsets are used for training 
and the remaining is left for test. This makes a total of ten pairs of training 
and test sets. The error of a classifier on the total dataset is then given by the 
average of the errors observed in each test partition. 

For ANNs, the training sets obtained were further subdivided in training 
and validation subsets, in a proportion of 75% and 25%, respectively. While 
the training set was applied in the determination of the network weights, the 
validation set was employed in the evaluation of the generalization capacity of 
the ANN on new patterns during its training. The network training was stopped 
when the validation error started to increment, in a strategy commonly referred 
as early-stopping [6]. With this procedure, overfitting to training data can be 
reduced. The validation set was also employed in the determination of the best 
network architecture. Several networks with different architectures were gene- 
rated for each problem, and the one with lower validation error was chosen as 
the final ANN classifier. In this work, the architectures tested were one-hidden- 
layer ANNs completely connected with 1, 5, 10, 15, 20, 25 and 30 neurons on 
the hidden layer. The standard back-propagation algorithm was employed on 
training with a learning rate of 0.2 and the SNNS {Stuttgart Neural Network 
Simulator) [19] simulator was used in the networks generation. 

The software applied in SVMs induction was the SVMTorch II tool [2] . In all 
experiments conducted, a Gaussian Kernel with standard deviation equal to 5 
was used. The parameter C was kept equal to 100, default value of SVMTorch II. 
Although the best values for the SVM parameters may differ for each multiclass 
strategy, they were kept the same to allow a fair evaluation of the differences 
between the techniques considered. 
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The codewords used in the ECOC and LECOC strategies were obtained 
following a heuristic proposed in [4] . Given a problem with 3 < fc < 7 classes, k 
codewords of length — 1 are constructed. The codeword of the first class is 
composed only of ones. For the other classes Cj, where i > 1, it is composed of 
alternate runs of zeros and ones. 

Following, Section 3.1 summarizes the results observed and Section 3.2 dis- 
cusses the work conducted. 

3.1 Results 

Table 2 presents the accuracy (percent of correct classifications) achieved by the 
multiclass strategies investigated. The first and second best accuracies obtained 
in each dataset are indicated in boldface and italic, respectively. 



Table 2. Multiclass strategies accuracies 



Dataset 


lAA 


VAAA 


AAA-ANN 


DAGSVM 


ECOG 


LEGOC 


Balance 


96.5±3.1 


97.9T2.0 


99.2T0.8 


98.4±1.8 


90.7±4.7 


96.5±3.1 


Bridges 


58.6T19.6 


60.0T17.6 


58.6T12.5 


62.9T16.8 


58.6i20.7 


61.4±19.2 


Glass 


64.9T14.2 


67.3T10.3 


68.2±9.7 


68. Till. 9 


65.3T14.5 


65.8T14.6 


Iris 


95.3±4.5 


96.0T4.7 


96.0T4.7 


96.0i4.7 


94.7±6.1 


95.3±4.5 


Pos op. 


63.5±18.7 


64.6T19.3 


62.2T17.1 


61.3T26.7 


61.3T18.9 


63.5±18.7 


Splice 


96.8±0.9 


96.8T0.7 


83.4T1.7 


96.8i0.7 


93.6il.7 


96.8±0.9 


Vehicle 


85.4T4.5 


85.5T4.0 


84.4T5.3 


85.8i4.0 


81.9±4.7 


85.5±3.6 


Wine 


98.3T2.7 


98.3T2.7 


97.2±3.9 


98.3i2.7 


97.2i4.1 


98.3i2.7 


Zoo 


95.6T5.7 


944±5.9 


94.4±5.9 


94.4±5.9 


95.6i5.7 


95.6i5.7 



Similarly to Table 2, Table 3 presents the mean time spent on training, in 
seconds. All experiments were carried out on a dual Pentium II processor with 
330 MHz and 128 MB of RAM memory. 

Table 4 shows the medium number of support vectors of the models (j) SVs). 
This value is related to the processing time required to classify a given pattern. 
A smaller number of SVs leads to faster predictions [9]. 

In the case of the AAA combination with ANNs, other question to be conside- 
red in terms of classification speed is the network architecture. Larger networks 
lead to slower classification speeds. Table 5 shows, for each dataset, the number 
of hidden neurons of the best ANN architectures obtained in each dataset. 

3.2 Discussion 

According to Table 2, the accuracy rates achieved by the different multiclass tech- 
niques are not much different. Applying the corrected resampled t-test statistic 
described in [12] to the first and second best results in each dataset, no statistical 
significance can be detected at 95% of confidence level. However, the results sug- 
gests that the most successful strategy is the DAGSVM. On the other side, the 
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Table 3. Training time (seconds) 



Dataset 


lAA 


VAAA 


AAA-ANN 


DAGSVM 


ECOC 


LECOC 


Balance 


5.7T1.1 


4.2T1.1 


33.7T1.0 


4.2T1.1 


4 . 3 ± 0.5 


4 . 3 ± 0.5 


Bridges 


11.3T1.1 


43.4T1.9 


50.3T10.8 


43.4T1.9 


64.3T3.4 


64.3T3.4 


Glass 


10.2T0.8 


38 . 7 ± 2.0 


42.1T3.2 


38 . 7 ± 2.0 


72.5T2.0 


72.5T2.0 


Iris 


1.7T1.2 


1.9T1.0 


11.7T7.0 


1.9T1.0 


1 . 8 ± 1.2 


1 . 8 ± 1.2 


Pos op. 


3.6T0.8 


4.4T0.7 


7.3T1.2 


4.4T0.7 


3 . 7 ± 0.7 


3 . 7 ± 0.7 


Splice 


476.7T17.1 


205.9T1.2 


388 . 2 ± 4.2 


205.9T1.2 


497.4T41.5 


497.4T41.5 


Vehicle 


21 . 9 ± 0.9 


17.4T1.1 


96.7T1.3 


17.4T1.1 


45.2T1.3 


45.2T1.3 


Wine 


4 . 9 ± 0.6 


4.7T0.7 


13.5T19.0 


4.7T0.7 


5.0T0.0 


5.0T0.0 


Zoo 


12.3T1.2 


36 . 8 ± 3.5 


50.3T2.1 


36 . 8 ± 1.2 


108.6T1.9 


108.6T1.9 



Table 4. Mean number of Support Vectors (SVs) 



Dataset 


lAA 


VAAA 


AAA-ANN 


DAGSVM 


ECOC 


LECOC 


Balance 


208.2±10.7 


115.S±6.5 


115.3±6.5 


69.8±3.5 


208.2T10.7 


208.2T10.7 


Bridges 


129.6±4.1 


175.3T3.4 


175.3T3.4 


59.3±2.7 


879.4T29.7 


879.4T29.7 


Glass 


275.5T7.5 


252.2±6.2 


252.2±6.2 


100.6±5.3 


2160.4T67.9 


2160.4T67.9 


Iris 


40.8T2.7 


24.5±1.4 


24.5±1.4 


16.7±1.3 


40.7T2.8 


40.7T2.8 


Pos op. 


106.8T7.6 


61.0±6.1 


61.0±6.1 


54.2±4.9 


107.3T7.5 


107.3T7.5 


Splice 


5043.0T16.4 


S625.2±10.S 


3625. 2± 10. 3 


2577.1±9.0 


5043.0T16.4 


5043.0±16.4 


Vehicle 


669. Dili. 1 


469.2±5.0 


469.2±5.0 


232.1±9.7 


1339.5T22.7 


1339.5T22.7 


Wine 


75.2T1.5 


54.3±3.4 


54.3±3.4 


35.3±3.2 


75.2T1.5 


75.2T1.5 


Zoo 


132.9.6±4.5 


191.2T6.8 


191.2T6.8 


62.0±4.8 


1608.4T65.8 


1608.4T65.8 



Table 5. Number of hidden neurons in the A A A- ANN architectures 



Balance Bridges 


Glass 


Iris Pos op. 


Splice 


Vehicle 


Wine 


Zoo 


1 5 


5 


10 1 


1 


5 


30 


5 



ECOC method presents, in general, the lowest performance. It must be observed 
that the simple modification of this technique with the use of a distance measure 
based in margins (LECOC) improves its results substantially. In a comparison 
among the methods with best and worse accuracy in each dataset, a statistical 
significance of 95% can be verified in the following datasets: “balance”, “splice” 
and “vehicle”. 

Comparing the three methods for AAA combination, no statistical signifi- 
cance at 95% of confidence level can be verified in terms of accuracy - except on 
the “splice” dataset, where the ANN integration was worst. It should be noticed 
that the ANN approach presents a tendency in some datasets towards lowing the 
standard deviation of the accuracies obtained, indicating some stability gain. 

Concerning training time, in general the faster methodology was lAA. The 
ECOC and LECOC approaches, on the other hand, were generally slower in 
this phase. The lower training time achieved by VAAA and DAGSVM in some 
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datasets is due to the fact that this method trains each SVM on smaller subsets 
of data, which speeds it up. In the AAA-ANN case, the ANN training time 
has to be taken into account, which gives a larger time than those of VAAA or 
DAGSVM. 

From Table 4 it can be observed that the DAGSVM method had the lower 
number of SVs in all cases. This means that the DAG strategy speeds up the 
classification of new samples. VAAA figures as the method with second lowest 
number of SVs. Again, the simpler data samples used in the binary classifiers 
induction in this case can be the cause of this result. It should be noticed that, 
although AAA-ANN has the same number of SVs of VAAA, the ANN prediction 
stage has to be considered. 

4 Conclusion 

This work evaluated several techniques for multiclass classification with SVMs, 
originally binary predictors. There are currently works generalizing SVMs to the 
multiclass case directly [7,18]. However, the focus of this work was on the use 
of SVMs as binary predictors, and the methods presented can be extended to 
other Machine Learning techniques. 

Although some differences were observed among the methods in terms of per- 
formance, in general no technique can be considered the most suited for a given 
application. When the main requirement is classification speed, while maintain- 
ing a good accuracy, the results observed indicate that an efficient alternative 
for SVMs is the use of the DAG approach. 

As future research, further experiments should be conducted to tune the 
parameters of the SVMs (Gaussian Kernel standard deviation and the value of 
C). This could improve the results obtained in each dataset. 
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Abstract. A novel approach to feature selection is presented in this paper, in 
which the aim is to visualize and extract information from complex, high 
dimensional spectroscopic data. The model proposed is a mixture of factor 
analysis and exploratory projection pursuit based on a family of cost functions 
proposed hy Fyfe and MacDonald [12] which maximizes the likelihood of 
identifying a specific distribution in the data while minimizing the effect of 
outliers [9,12]. It employs cooperative lateral connections derived from the 
Rectified Gaussian Distribution [8,14] to enforce a more sparse representation 
in each weight vector. We also demonstrate a hierarchical extension to this 
method which provides an interactive method for identifying possibly hidden 
structure in the dataset. 



1 Introduction 

We introduce a method which is closely related to factor analysis and exploratory 
projection pursuit. It is a neural model based on the Negative Feedback artificial 
neural network, which has been extended by the combination of two different 
techniques. Firstly by the selection of a cost function from a family of cost functions 
which identify different distributions. This method is called Maximum-Likelihood 
Hebbian learning [6]. Secondly, cooperative lateral connections derived from the 
Rectified Gaussian Distribution [8] were added to the Maximum-Likelihood method 
by Corchado et al. [14] which enforced a greater sparsity in the weight vectors. 

In this paper we provide a hierarchical extension to the Maximum-likelihood 
method. 



2 The Negative Feedback Neural Network 

The Negative Feedback Network [4,5] is the basis of the Maximum-Likelihood 
model. Consider an N-dimensional input vector, x, and a M-dimensional output 
vector, y, with W.j being the weight linking input j to output i and let T] be the learning 
rate. 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 282-291, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 




Analysing Spectroscopic Data 283 



The initial situation is that there is no activation at all in the network. The input 
data is fed forward via weights from the input neurons (the x-values) to the output 
neurons (the y-values) where a linear summation is performed to give the activation 
of the output neuron. We can express this as: 



j=i 



( 1 ) 



The activation is fed back through the same weights and subtracted from the inputs 
(where the inhibition takes place): 






( 2 ) 



After that simple Hebbian learning is performed between input and outputs: 



^WiJ =r]ejy, 



(3) 



Note that this algorithm is clearly equivalent to Oja’s Subspace Algorithm [7] since 
if we substitute Equation 2 in Equation 3 we get: 



) (4) 

This network is capable of finding the principal components of the input data [4] in 
a manner that is equivalent to Oja’s Subspace algorithm [7], and so the weights will 
not find the actual Principal Components but a basis of the Subspace spanned by these 
components. 

Eactor Analysis is a technique similar to PCA in that it attempts to explain the data 
set in terms of a smaller number of underlying factors. However Eactor Analysis 
begins with a specific model and then attempts to explain the data by finding 
parameters which best fit this model to the data. Charles [2] has linked a constrained 
version of the Negative Eeedback network to Factor Analysis. The constraint put on 
the network was a rectification of either the weights or the outputs (or both). Thus if 
the weight update resulted in negative weights, those weights were set to zero; if the 
feedforward mechanism gives a negative output, this was set to zero. We will use the 

notation [f]^ for this rectification: if t<0, t is set to 0; if t>0, t is unchanged. 



AWjj =nejyi =J]\ 



3 £- Insensitive Hebbian Learning 

It has been shown [10] that the nonlinear PCA rule 

can be derived as an approximation to the best non-linear compression of the data. 
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Thus we may start with a cost function 

y(iT) = f £|(x-iT/(iy^x))"| 

which we minimise to get the rule(5). [12] used the residual in the linear version of 
(6) to define a cost function of the residual 

^ = /i(e) = /i(x-Wy) (7) 

j . ||||2 

where Jj = . is the (squared) Euclidean norm in the standard linear or nonlinear 

PCA rule. With this choice of /i ( ) , the cost function is minimized with respect to 
any set of samples from the data set on the assumption that the residuals are chosen 
independently and identically distributed from a standard Gaussian distribution [15]. 

We may show that the minimization of J is equivalent to minimizing the negative 
log probability of the residual, e if e is Gaussian. Let: 

P(e) = (j^)exp(-e^) (8) 



The factor Z normalizes the integral of /?(y) to unity. 

Then we can denote a general cost function associated with this network as 

7 = -logp(e) = (e)" + ^: (9) 



where K is a constant. Therefore performing gradient descent on J we have 



m 



97 

~dW 



97 9e 

~ y(2e) 

9e9lT 



T 



( 10 ) 



where we have discarded a less important term (see [11] for details). 

In general[9], the minimisation of such a cost function may be thought to make the 
probability of the residuals greater dependent on the probability density function 
(pdf) of the residuals. Thus if the probability density function of the residuals is 
known, this knowledge could be used to determine the optimal cost function. 

[12] investigated this with the (one dimensional) function: 



where 



p(e) 



1 

2 + £* 




J 0_V|e|<f 
I |e| - e _otherwise 



( 11 ) 



( 12 ) 



with e being a small scalar > 0 . 

[12] described this in terms of noise in the data set. However we feel that it is more 
appropriate to state that, with this model of the pdf of the residual, the optimal /i ( ) 
function is the 8-insensitive cost function: 

/i(e) = |e| 



(13) 
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In the case of the Negative Feedback Network, the learning rule is 



which gives: 



dJ _ 3/i(e) 3e 
dW ~ de dW 



AW, = 



< £ 



0 if\ej\ 
rjy. {sign{e, ))otherwise 



(14) 



(15) 



The difference with the common Hebb learning rule is that the sign of the residual 
is used instead of the value of the residual. Because this learning rule is insensitive to 
the magnitude of the input vectors x, the rule is less sensitive to outliers than the usual 
rule based on mean squared error. 



4 Maximum Likelihood Hebbian Learning 



Now the E-insensitive learning rule is clearly only one of a possible family of learning 
rules which are suggested by the family of exponential distributions. Let the residual 
after feedback have probability density function 



Then we can denote a general cost function associated with this network as 

7 = -log;?(e)=|e|'’ +K 

where K is a constant. Therefore performing gradient descent on J we have 

AW oc ~ y(^p I e |p-i sign{e)Y 

dW dedW 



(17) 

(18) 



where T denotes the transpose of a vector. We would expect that for leptokurtotic 
residuals (more kurtotic than a Gaussian distribution), values of p<2 would be 
appropriate, while for platykurtotic residuals (less kurtotic than a Gaussian), values of 
p>2 would be appropriate. Therefore the network operation is: 



Feedforward: 


> 

II 


(19) 


Feedback: 


II 

1 


(20) 


Weight change: 


AWy =7j.y..sign{ej)\ej K"' 


(21) 



[12] described their rule as performing a type of PGA, but this is not strictly true since 
only the original (Oja) ordinary Hebbian rule actually performs PGA. It might be 
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more appropriate to link this family of learning rules to Principal Factor Analysis 
since this method makes an assumption about the noise in a data set and then removes 
the assumed noise from the covariance structure of the data before performing a PCA. 
We are doing something similar here in that we are basing our PCA-type rule on the 
assumed distribution of the residual. By maximising the likelihood of the residual 
with respect to the actual distribution, we are matching the learning rule to the pdf of 
the residual. This method has been linked to the standard statistical method of 
Exploratory Projection Pursuit (EPP) [4,13,16]. EPP also gives a linear projection of a 
data set but chooses to project the data onto a set of basis vectors which best reveal 
the interesting structure in the data. 



5 The Rectified Gaussian Distribution 

5.1 Introduction 

The Rectified Gaussian Distribution [8] is a modification of the standard Gaussian 
distribution in which the variables are constrained to be non-negative, enabling the 
use of non-convex energy functions. 

The multivariate normal distribution can be defined in terms of an energy or cost 
function in that, if realised samples are taken far from the distribution’s mean, they 
will be deemed to have high energy and this will be equated to low probability. More 
formally, we may define the standard Gaussian distribution by: 

p(y)=Z-‘c-^^(’'), (22) 

£(y)=(XWy-b^y (23) 

The quadratic energy function E(y) is defined by the vector b and the symmetric 

matrix A. The parameter p = is an inverse temperature. Lowering the 
temperature concentrates the distribution at the minimum of the energy function. 



5.2 The Energy Function and the Cooperative Distribution 

The quadratic energy function £(y) can have different types of curvature depending 
on the matrix A. Consider the situation in which the distribution of the firing of the 
outputs of our neural network follows a Rectified Gaussian Distribution. 

Two examples of the Rectified Gaussian Distribution are the competitive and the 
cooperative distributions. The modes of the competitive distribution are well- 
separated by regions of low probability. The modes of the cooperative distribution are 
closely spaced along a non-linear continuous manifold. Our experiments focus on a 
network based on the use of the cooperative distribution. 

Neither distribution can be accurately approximated by a single standard Gaussian. 
Using the Rectified Gaussian, it is possible to represent both discrete and continuous 
variability in a way that a standard Gaussian cannot. 
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The sorts of energy function that can be used are only those where the matrix A has 
the property: 



y’^Ay>0 for all y- yi>0,i = \...N 



(24) 



where N is the dimensionality of y. This condition is called co-positivity. This 
property blocks the directions in which the energy diverges to negative infinity. 

The cooperative distribution in the case of N variables is defined by: 




(25) 



k =1 



(26) 



where is the Kronecker delta and i and j represent the identifiers of output neuron. 
To speed learning up, the matrix A can be simplified [3] to: 

A.J = [Sy - cos(2;r(/ - j)/ N)) (27) 

and is shown diagrammatically in Figure 1. The matrix A is used to modify the 
response to the data based on the relation between the distances between the outputs. 



5.3 Mode-Finding 

Note that the modes of the Rectified Gaussian are the minima of the energy function, 
subject to non-negativity constraints. However we will use what is probably the 
simplest algorithm, the projected gradient method, consisting of a gradient step 
followed by a rectification: 



y; (^ + 1) = [y (0 + ^(* - Ay)f 



(28) 



where the rectification [ J* is necessary to ensure that the y-values keep to the positive 
quadrant. If the step size ris chosen correctly, this algorithm can provably be shown 
to converge to a stationary point of the energy function [1]. In practice, this stationary 
point is generally a local minimum. 

The mode of the distribution can be approached by gradient descent on the 
derivative of the energy function with respect to y. This is: 



Ay oc = -(Ay-b) = b-Ay 

dy 



(29) 



which is used as in Equation 28. 

Now the rectification in Equation 28 is identical to the rectification which 
Corchado [14] used in the Maximum-Likelihood Network. 

We use the standard Maximum-Likelihood Network but now with a lateral 
connection (which acts after the feed forward but before the feedback). Thus we have 



Vi 



Feedforward: 



y, 



(30) 
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Lateral Activation Passing: ^ ^ ^ 

Feedback: « 

i=l 

Weight change: aW^ = rj.y^ .sign[e ^ ) | e. I"-* 

where the parameter rrepresents the strength of the lateral connections. 



( 31 ) 

( 32 ) 

( 33 ) 



6 Generating a Reduced Scatterplot Matrix Using Cooperative 
Maximum Likelihood Learning 

When researchers initially investigated spectroscopic data they looked for structure by 
generating a scatterplot matrix in which they plotted each dimension of the data 
against one another. This technique rapidly became less viable as the dimensionality 
of the data increased. 

Investigators later used techniques such as PCA to provide a single projection 
which tried to provide as much information as possible. In our method, unlike PCA 
we have no ordering in our projections, but we reduce the number of factors back 
down to a manageable level where we can generate this scatterplot matrix and look 
for the structure hy eye. As ML looks for correlations or clusters, it will generate 
factors which are a linear combination of the data, resulting in fewer factors than the 
dimensions of the data. 



7 Hierarchical Cooperative Maximum Likelihood Method 
(HCML) 

ML and FA can only provide a linear projection of the data set. There may be cases 
where the structure of the data may not be captured by a single linear projection. In 
such cases a hierarchical scheme may be beneficial. In this method we project the 
data, and then perform brushing to select data within regions of interest which we 
then re-project. 

This can be done in two ways, firstly by projecting the data using the ML method, 
select the data points which are interesting and re-run the ML network on the selected 
data. Using this method only the projections are hierarchical. 

A second more interesting adaptation is to use the resulting projected data of the 
previous ML network as the input to the next layer. Each subsequent layer of the 
network identifying structure among fewer data points in a lower dimensional 
subspace. 

The Hierarchical Cooperative Maximum Likelihood method can reveal structure in 
the data which would not be identified by a single Maximum Likelihood projection as 
each subsequent projection can analyse different sections of the subspace spanned by 
the data. 
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8 The Spectroscopic Stain Glass Data 

The data used to illustrate our method is composed of samples from 76 different 
sections of the window. The window has 6 colours which are green, red, blue, yellow, 
pink and white. After a morphological study the structure of the red stain glass 
sections were found to consist of two layers, one transparent and the other coloured. 
This resulted in the re-sampling of the red glass as two separate samples, one red and 
the other transparent. After this the data contained 450 data vectors obtained from 90 
samples each having been analysed 5 times. The data is 1020 dimensions, which after 
normalisation was reduced to 390 dimensions. 



9 Results and Conclusions 

In this section we show the results obtained on the spectroscopic data and highlight 
the differences in the projections obtained by PCA and ML. We also demonstrate the 
HCML Method. 

Comparison of PCA and ML on the Spectroscopic Data 

In Figure 1 we show the comparison of PCA and ML projections of the spectroscopic 
data on the first 4 eigenvectors/factors respectively. ML (Figure la) clearly shows 
more structure and greater separation between clusters than is achieved with PCA 
(Figure lb). 
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a. ML on spectroscopic data. b. PCA on spectroscopic data. 

Fig. 1. ML and PCA on spectroscopic data. 

Figure 2 shows a comparison of the first 2 eigenvector pairs (Figures 2b, 2d) 
against the ML first two factor pairs (Figures 2a, 2c). In Figure 2a the projection is 
more spread out with a large separation between the two main clusters. There is a 
very strongly grouped central cluster in the Figure 2c, which contains glass from class 
28. Class 10 is at the top of the right most cluster and class 4 at the right part of other 
cluster. The first eigenvector pair in Figure 2b shows two distinct clusters. In the 
center between these two clusters we can see classes 28 and 70 together, 18 on its own 
again and 70 and 52 are almost distinct from the left cluster. We can see that there is 
some structure in the clusters which hints at further sub-clusters. Unlike Figure 2a, 
Class 10 in the bottom left of the Figure 2b is not completely distinct from class 23, 
which is spread throughout the cluster from the top to the bottom. 

ML factor pair 1-3 (Figure 2c) is more defined than any other eigenvector/factor 
pair, showing much greater definition of sub-clusters, separation of class 18, 70 as a 
cluster, 28 as a cluster, 47 is in the center of the projection, distinct if spread out along 
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the y-axis. In the right most cluster we can see very distinct structure which suggests 
9 sub-clusters, upon investigation these give quite distinct and sensible groupings. 
Table 1 shows some of the classes in each of these sub-clusters. 







a. ML factor pair 1-2 b. PCA eigenvector pair 1-2 
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c. ML factor pair 1-3 



d. PCA eigenvector pair 1-3 



Fig. 2. A comparison of the ML factor 1-2, 1-3 pair and PCA eigenvector 1-2, 1-3 pair. 



Table 1. Classes belonging to 6 of the sub-clusters found in the right most cluster of Figure 2c. 



Cluster 


1 


2 


3 


4 


5 


6 


Classes 


19, 24 


11,46, 49,70 


3, 5, 7, 60 


22, 44, 64 


51, 51t 


53, 59 



PCA eigenvector pair 1-3 (Figure 2d) suggests some sub-clustering in the right 
most cluster but upon investigation these sub-clusters are not well defined, containing 
an uninterpretable mix of classes. 



Results of HCML on Spectroscopic Data 

Figure 3 shows the result of HCML on the right most cluster of Figure 2c, the outputs 
of the 4 ML factors we used as the inputs to the second layer of the HCML network, 
this resulted in the greater separation between the clusters, Figure 3. 





■ 



Fig. 3. Result of HCML on right cluster in Figure 2c. 

In Figure 3, we can see that the HCML shows new clusters in the hierarchical 
projection of the original factors from Figure 2c. The projection identifies separate 
clusters and a central mass which after a separate projection gives more clusters still. 
The HCML method has the advantage that the selection of regions of interest can 
occur repeatedly until the structure in the data has been fully identified. 
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We have developed a new method for the visualization of data which can be used 
as a complementary technique to PCA. 

We have shown that the combination of the scatter plot matrices with HCML 
providing a method for selecting a combination of factor pairs which identify strong 
structure in the data. By then selecting these points and projecting through the HCML 
network and re-projecting in a further layer the network can identify further structure 
in the data. 
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Abstract. Eeature selection is a crucial activity when knowledge discovery is 
applied to very large databases, as it reduces dimensionality and therefore the 
complexity of the problem. Its main objective is to eliminate attributes to obtain 
a computationally tractable problem, without affecting the quality of the 
solution. To perform feature selection, several methods have been proposed, 
some of them tested over small academic datasets. In this paper we evaluate 
different feature selection-ranking methods over a very large real world 
database related with a Mexican electric energy client-invoice system. Most of 
the research on feature selection methods only evaluates accuracy and 
processing time; here we also report on the amount of discovered knowledge 
and stress the issue around the boundary that separates relevant and irrelevant 
features. The evaluation was done using Elvira and Weka tools, which integrate 
and implement state of the art data mining algorithms. Einally, we propose a 
promising feature selection heuristic based on the experiments performed. 



1 Introduction 

Data mining is mainly applied to large amounts of stored data to look for the implicit 
knowledge hidden within this information. In other words, it looks for tendencies or 
patterns of behavior that allow to improve actual organizational procedures of 
marketing research, production, operation, maintenance, invoicing and others. To take 
advantage of the enormous amount information currently available in many databases, 
algorithms and tools specialized in the automatic discovery of hidden knowledge 
within this information have been developed; this process of non-trivial extraction of 
relevant information that is implicit in the data is known as Knowledge Discovery in 
Databases (KDD), where the data mining phase plays a central role in this process [1]. 

It has been noted, however, that when very large databases are going to get mined, 
the mining algorithms get very slow, requiring too much time to process the 
information and sometimes making the problem intractable. One way to attack this 
problem is to reduce the amount of data before applying the mining process [2]. In 
particular, the pre-processing method of feature selection applied to the data before 
mining has shown to be successful, because it eliminates the irrelevant or redundant 
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attributes that cause the mining tools to become inefficient, but preserving the 
classification quality of the mining algorithm. Sometimes the percentage of instances 
correctly classified gets even higher when using feature selection, because the data to 
mine are free of noise or data that cause that the mining tool to generate overfitted 
models [3]. 

In general, wrapper and filter methods have been applied to feature selection. 
Wrapper methods, although effective to eliminate irrelevant and redundant attributes, 
are very slow because they apply the mining algorithm many times, changing the 
number of attributes each execution time, as they follow some search and stop criteria 
[4]. Filter methods are more efficient using existing techniques such as decision trees 
algorithms, neuronal networks, nearest neighborhood, etc., that take into account 
dependencies between attributes. Another technique, called ranking method, uses 
some type of information gain measurement between individual attributes and the 
class, and it is very efficient [5]; however, because it measures the relevance of each 
isolated attribute, they cannot detect if redundant attributes exist, or if a combination 
of two attributes, apparently irrelevant when analyzed independently, can be 
transformed into relevant [6] . 

On the other hand, CFE (Federal Commission of Electricity in Mexico) faces the 
problem to accurately detecting customers that illicitly use energy, and consequently 
to reduce the losses due to this concept. At present time, a lot of historical information 
is stored in the Commercial System (SICOM), whose database was developed and it 
is maintained by CEE. SICOM was created mainly to register the users contract 
information, and the invoicing and collection data; this database has several years of 
operation and has a great amount of accumulated data (millions of records). 

To make feasible the mining of this large database, in an effective and efficient 
way, in this paper we present an evaluation of different filter-ranking methods for 
supervised learning. The evaluation takes into account not only the classification 
quality and the processing time obtained after the filter application of each ranking 
method, but also it considers the discovered knowledge size, which, the smaller, the 
easier to interpret. 

Also the boundary selection topic to determine which attributes must be considered 
relevant and which irrelevant is approached, since the ranking methods by themselves 
do not give this information, this decision is left without criterion. We propose an 
extension, simple to apply, that allows unifying the criterion for the attributes 
boundary in the different evaluated ranking methods. 

Finally, based on the experimentation results, we propose a heuristic that looks for 
the efficient combination of ranking methods with the effectiveness of the wrapper 
methods. Although our work focuses on the SICOM data, the lessons learned can be 
applied to other real world databases with similar problems. 



2 Related Work 

The emergence of Very Large Databases (VLDB) leads to new challenges that the 
mining algorithms of the 90"s are incapable to attack efficiently. This is why new 
specialized mining algorithms for VLDB are required. According to [7], from the 
point of view of the mining algorithms, the main lines to deal with VLDB (scaling up 
algorithms) are: a) to design fast algorithms, optimizing searches, reducing 
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complexity, finding approximate solutions, or using parallelism; b) to divide the data 
based on the variables involved or the number of examples; and c) to use relational 
representations instead of a single table. 

In particular, these new approaches in turn give origin to Data Reduction or 
Dimensional Reduction. Data Reduction tries to eliminate variables, attributes or 
instances that do not contribute information (or they do not contribute much 
information) to the KDD process, or to group the values that a variable can take 
(discretizing). These methods are generally applied before the actual mining is 
performed. Although in the 90 's the pre-processing was minimum, and almost all the 
discovery work was left to the mining algorithm, every day we see more and more 
Data Pre-processing activities. This pre-processing allows the mining algorithm to do 
its work more efficiently (faster) and effectively (better quality). 

In fact, the specialized literature mentions the curse of dimensionality, referring to 
the fact that the processing time of many induction methods grows dramatically 
(sometimes exponentially) with the number of attributes. Searching for improvements 
on VLDB processing power (necessary with tens of attributes and hundreds of 
thousands of instances), two main groups of methods have appeared: wrappers and 
filters. 

The wrapper methods basic approach is to use the same induction algorithm to 
select the relevant variables and then to execute the classification task or mining 
process. The mining algorithm executes as many times as it changes the number of 
attributes for each run. With a 100 attribute dataset the total number of possible states 
and runs would reach 1.26 X 10^°, which tells us that to use an exhaustive method is 
out of consideration, except for databases with very few attributes. 

On the other hand, filter methods use algorithms that are independent to the mining 
algorithm and they are executed previous to the mining step. Among filter methods 
are those algorithms for relevant variable selection, generally called feature selection, 
and the instance sampling methods, also known as sub sampling algorithms [8]. 

A great variety of filter methods exist for feature selection. Some authors consider 
the IDS algorithm [9] (and its extensions) as one of the first proposed approaches to 
filter, even so IDS is more used as a mining algorithm. Among the pioneering filter 
methods and very much cited are FOCUS [10], that makes an exhaustive search of all 
the possible attribute subgroups, but this is only appropriate for problems with few 
attributes, and RELIEF [11] that has the disadvantage of not being able to detect 
redundant attributes. 

Roller [12] uses a distance metric called cross-entropy or KL-distance, that 
compares two probability distributions and indicates the error, or distances, among 
them, and obtains around 50% reduction on the number of attributes, maintaining the 
quality of classification and being able to significantly reduce processing times (for 
example, from 15 hours of a wrapper scheme application, to 15 minutes for the 
proposed algorithm). The final result is “sub optimal” because it assumes 
independence between attributes, which it is not always true. Piramuthu [6] evaluates 
10 different measures for the attribute-class distance, using Sequential Forward 
Search (SFS) that includes the best attributes selected by each measure into a subset, 
such that the final result is a “better” attribute subset than the individual groups 
proposed by each method. However, the results are not compared with the original 
attribute set, and so it is not possible to conclude anything about the effectiveness of 
each measure, and although SFS manages to reduce the search space, multiple mining 
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algorithm runs varying the attribute subsets are necessary to validate the scheme, and 
this is computationally expensive. 

SOAP is a method that operates only on numerical attributes [13] and has a low 
computational cost; it counts the number of times the class value changes with respect 
to an attribute whose values have been sorted into ascending order. SOAP reduces the 
number of attributes as compared to other methods; nevertheless it does not handle 
discrete attributes and the user has to supply the number of attributes that will be used 
in the final subset. Another filter scheme, SAPPP [5], handles continuous and discrete 
attributes; initially SAPPP selects an attribute subset and each time that increases the 
number of attributes uses a decision tree construction algorithm to evaluate if the 
added attributes in the subset are more relevant with respect to the previous tree. It 
verifies if they affect the classification quality (accuracy) and if they do not affect it, 
they are discarded (because they are irrelevant) and the process stops. A 30% 
reduction in processing time was obtained, maintaining the classification accuracy. In 
spite of everything, work must be done to solve how many instances to use at the 
beginning and the increment selection for each step. 

Molina [14] tried to characterize 10 different methods to select attributes by 
measuring the impact of redundant and irrelevant attributes, as well as of the number 
of instances. Significant differences could not be obtained, and it was observed that, 
in general, the results of the different methods depended on the data being used. 
Stoppiglia [15] proposes to introduce an additional random variable to the database, 
such that, after the attribute ranking time, all those variables that obtained less scores 
than the random variable, will be considered irrelevant. This criterion represents an 
alternative to the statistical Fisher's test. The results show that the method obtains a 
good attributes selection, comparable to other techniques. This method is attractive 
because of simplicity, although more experiments are needed to prove its 
effectiveness, for example, it could be that most of the time the random variable 
manages only to eliminate or discriminate very few attributes (or none), so that the 
power to select attributes would be reduced. In section 4.3, we will explore this and 
other subjects. 

Other proposals for feature selection explore the use of neural networks, fuzzy 
logic, genetic algorithms, and support vector machines [3], but they are 
computationally expensive. In general, it is observed that the methods that have been 
proposed: a) are verified with small, academic or simulated databases; b) obtain 
results that vary with the domain of the application; c) obtain greater quality of the 
result applying greater computational cost; d) depend on suitable tuning; and e) they 
do not evaluate the size of the extracted knowledge, which is a key factor to 
understand the phenomenon underlying the data. 



3 The Application Domain 

One the main CFE functions is to distribute to the costumers the electrical energy 
produced in the different generating plants in Mexico. Related to distribution, CFE 
faces different problems that prevent it to recover certain amount of “lost income” 
from the 100% of the total energy for sale. At present CFE loses approximately 21% 
of the energy for distribution. These losses are mainly due to two kinds of problems: 
a) technical, and b) administrative. The technical energy losses are usually in the 
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range of 10% and a great investment in new technologies would be needed in the 
distribution equipment to be able to reduce this percentage. The other 11% of the 
losses are due to administrative control problems, and they are classified in three 
categories of anomalies: a) invoicing errors, b) measurement errors, and c) illicit 
energy use or fraud. The first two have a minimum percentage impact so the big 
problem is the illicit use of energy, that is to say, people who steal the energy and 
therefore they do not pay for it. 

CFE has faced this problem applying different actions (as to increase the frequency 
of measurement equipment readings of suspect customers, or to install equipment for 
automatic readings) and has managed to reduce the percentage due to illicit use 
losses, which represents a recovery of several million dollars. Since the problem has 
not been completely solved, it is important to attack it with other technologies and 
actions, using a knowledge discovery approach based on data mining to obtain 
patterns of behavior of the illicit customers. This alternative solution does not require 
a great deal of investment and it has been proven to be effective in similar cases, like 
credit card fraud detection [16]. 

The subject information to analyze is a sample of the SICOM database, a legacy 
system developed with the COBOL language, it contains around twenty tables with 
information about contracts, invoicing, and collection from customers across the 
nation. This system was not designed with the illicit users discovery in mind; 
nevertheless, it contains a field called debit-type in which a record is made if the debit 
is due to illicit use of energy. After joining three tables, including the one that has the 
debit-type field, a “mine” with 35,983 instances was obtained with the following 
attributes: Permanent customer registry (RPU), Year, Month, debit-type. Digit, kWh, 
Energy, Cve-invoicing, Total, Status, Turn, Tariff, Name, Installed-load, Contract- 
load, and others that altogether add up to 21 attributes. One of the values that the 
attribute debit-type can be assigned is “9”, which indicates an illicit use, and it is our 
class attribute. Various experiments were executed with this database to evaluate the 
different ranking methods as described next. 



4 Evaluating Ranking Methods 

4.1 Measuring the Attributes Degree of Relevance 

The application of filter-ranking methods to select features of a VLDB is adequate 
due to its low computational cost. We use Elvira [17] and Weka [18] tools, since they 
provide suitable and updated platforms for the easy execution of multiple experiments 
in a PC environment. In the presentation of the experiments the processing time has 
been left out because it was always very small, for example, Elvira obtains, in less of 
a second, the Mutual Information distance to measure the relevance of 21 attributes 
using 35,983 instances. The result is shown in the left column of Table 1. 

Although in this case the attributes appear ordered according to their relevance, we 
lack of a uniform criterion to decide which attributes to select. We used the Stoppiglia 
criterion [15], but modifying it as follows: instead of using a single random variable, 
we added three, to observe how the ranking method maintains together, or not, the 
random variables in the set of ranked attributes, avoiding a possible bias introduced to 
the result by a single random variable, that in fact is a computational pseudo-random 
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Table 1. Ranking using Elvira (Mutual Information distance) 



1 Traditional Ranking 


Ranking 


with three random variables 


fctura 


0.09097299304149882 


fctura 


0.09097299304149882 


status 


0.06121332572180206 


status 


0.06121332572180206 


kwEen 


0.051186334426340505 


kwEen 


0.051186334426340505 


cCEto 


0.045967636246832214 


cCEto 


0.045967636246832214 


kwMen 


0.0443061751909163 


RAND3 


0.04450328124055651 


toMkw 


0.04376718990743937 


kwMen 


0.0443061751909163 


enrgia 


0.04325196857770465 


toMkw 


0.04376718990743937 


kwMcI 


0.04308595013830481 


enrgia 


0.04325196857770465 


toMcI 


0.04302669641028058 


kwMcI 


0.04308595013830481 


kwh 


0.04259503495345594 


toMcI 


0.04302669641028058 


total 


0.042438776707532586 


RAND2 


0.04295118668801855 


mes 


0.04204718796227498 


kwh 


0.04259503495345594 


toMcC 


0.04163309856095569 


total 


0.042438776707532586 


cIEen 


0.038549970847028533 


mes 


0.04204718796227498 


toMen 


0.03831938680147813 


toMcC 


0.04163309856095569 


cginst 


0.036173176514204305 


RANDl 


0.04031876955965204 


cgCont 


0.034291607355202744 


cIEen 


0.038549970847028533 


cIMcC 


0.02679884377613058 


toMen 


0.03831938680147813 


anio 


0.004512035977610684 


cginst 


0.036173176514204305 


tarifa 


0010537446951081608 


cgCont 


0.034291607355202744 


digito 


7.321404042019974E-4 


cIMcC 


0.02679884377613058 






anio 


0.004512035977610684 






tarifa 


0.0010537446951081608 






digito 


7.321404042019974E-4 



variable. The obtained result is shown in the right column of Table 1, where variables 
RAND3, 2 and 1 are the boundaries of the four subsets of attributes. 

Following the same procedure, we applied different ranking methods to the 
database (a detailed explanation of the used “distances” can be found in [6] and [14]); 
the results are shown in Table 2. Also, the methods: Principal Component Analysis 
(PCA), Information Gain, Gain Ratio and Symmetrical were explored, and they 
produced similar results as Chi-Square, which means that they did not obtain a 
significant reduction on the number of attributes. From Table 2 we observe that, 
although some ranking methods agree in the selection of some attributes, in general, 
each method produces different attribute ordering, including the position for the three 
random variables. (This is a very interesting result, as we will see in Table 3). 



4.2 Performance Evaluation of the Methods 

In order to evaluate the methods, we applied the J4.8 tree induction classifier (the 
Weka implementation of the last public version of C4.5) to the database ’’projected” 
on the attributes selected by each method. Table 3 shows the results. In all the cases, 
we always used the Weka’s default parameters and the attributes of the first subset 
identified by the appearance of the first random variable (in section 4.3 we analyze 
this in more detail). The feature reduction column measures the number of attributes 
selected against the total number of attributes. The processing time is expressed in 
relation to the time required to obtain a tree that includes all the attributes of the 
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Table 2. Application of different ranking measures 



Euclidean 

distance 


Matusita 

distance 


Kullback- 
Leibler 1 


Kullback- 
Leibler 1 


Shannon 

entropy 


Bhatta- 

charyya 


Relief 


OneR 


Chi- 

Square 


fctura 


fctura 


fctura 


fctura 


kwh 


kwEen 


anio 


factra 


factra 


mes 


kwEen 


status 


mes 


enrgia 


fctura 


mes 


status 


status 


cIMcC 


kwMen 


kwEen 


status 


total 


kwMen 


factra 


anio 


mes 


anio 


RAND3 


cCEto 


cginst 


tarifa 


RAND3 


digito 


tarifa 


kwEen 


RAND3 


status 


RAND3 


cgCont 


cginst 


toMkw 


RAND3 


digito 


kwMcI 


tarifa 


cCEto 


kwMen 


cCEto 


cgCont 


toMcI 


RANDl 


mes 


kwh 


digito 


toMkw 


toMkw 


cIMcC 


kwEen 


enrgia 


RANDl 


cIMcC 


toMcI 


status 


enrgia 


enrgia 


anio 


toMcI 


cCEto 


status 


cgCont 


toMcC 


RAND2 


toMcI 


kwMcI 


kwEen 


kwMen 


total 


cginst 


cginst 


total 


cIEen 


total 


toMcI 


RANDl 


toMen 


RANDl 


tarifa 


RANDl 


toMen 


cginst 


kwMcI 


RANDl 


RAND3 


toMkw 


kwMcI 


cgCont 


toMkw 


enrgia 


cgCont 


RANDl 


kwh 


toMkw 


kwMcI 


toMcC 


cCEto 


RANDl 


kwMen 


RANDl 


kwh 


total 


kwh 


toMcC 


RANDl 


cIEen 


cCEto 


toMkw 


cCEto 


toMcC 


mes 


kwMen 


cCEto 


kwh 


cIMcC 


kwh 


cCEto 


toMcC 


RANDl 


toMcC 


cIEen 


cIEen 


toMen 


kwEen 


toMcI 


cIEen 


kwMcI 


toMen 


RANDl 


enrgia 


status 


mes 


toMen 


RAND3 


cginst 


toMkw 


mes 


cIEen 


total 


cIMcC 


status 


total 


total 


cgCont 


toMen 


cIEen 


toMen 


kwMcI 


anio 


cIEen 


toMcC 


toMen 


anio 


kwMen 


cginst 


cginst 


RANDl 


RANDl 


cginst 


toMcI 


kwEen 


cIMcC 


toMcI 


cgCont 


cgCont 


tarifa 


RANDl 


cgCont 


kwh 


enrgia 


tarifa 


kwEen 


cIMcC 


cIMcC 


digito 


RAND3 


cIMcC 


kwMcI 


kwMen 


RANDl 


total 


anio 


anio 


toMcC 


digito 


anio 


kwMen 


cIEen 


RAND3 


enrgia 


tarifa 


tarifa 


toMcI 


mes 


tarifa 


toMkw 


kwMcI 


RANDl 


kwh 


digito 


digito 


toMen 


fctura 


digito 


enrgia 


toMcC 


digito 



database (complete case). The size of the discovered knowledge is measured by the 
number of leaves and the number of nodes of the induced tree. The classification 
quality appears as the percentage of instances correctly classified using the training 
data (accuracy) and also using a 10-fold cross validation test. A column is included 
that considers cost-benefit that it would be obtained if the discovered knowledge were 
applied by the organization, and assuming that each inspection has a cost of -2.5 units 
and that the obtained benefit of a correct prediction of an illicit is of h- 97.5 units. The 
reported cost-benefit corresponds to the application of the above mentioned 10-fold 
cross validation test and it is calculated considering that the complete case obtains a 
1000 units of benefit, and the results of the other methods are normalized with respect 
to the complete case. 

In Table 3, we observe that most of the methods obtain a reduction of the number 
of attributes greater than 0.50 and reduce the mining algorithm processing time in an 
order of magnitude; a special case is Relief, that unlike the other methods whose 
processing time is small. Relief requires a proportion of time 9721 times greater than 
the time required to induce the tree by using all the attributes. With respect to the size 
of the discovered knowledge it is observed that almost all the methods produce trees 
smaller than the complete case. On the other hand, although apparently all the 
methods do not affect too much on the accuracy of the discovered knowledge, the 
cost-benefit column highlights those methods that better impact on the prediction of 
the illicit energy use patterns. 
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Table 3. Evaluating ranking methods by inducing J4.8 trees 



Method 


Feature 

reduction 


Time 


Leaves / 
Nodes 


Acc train / 
test 


Cost-benefit 

(test) 


Complete case 


0 


100 


21/41 


98.41/97.25 


1000 


Mutual 


0.80 


12 


5/9 


90.86/90.10 


444 


Information 












Euclidean distance 


0.80 


11 


3/5 


93.89/93.89 


520 


Matusita distance 


0.86 


8 


2/3 


90.58/90.21 


507 


Kullback-Leibler 1 


0.80 


11 


5/9 


90.86/90.10 


444 


Kullback-Leibler 2 


0.57 


14 


17/33 


98.26/97.50 


1001 


Shannon entropy 


0.14 


92 


23/45 


95.52/93.71 


876 


Bhattacharyya 


0.86 


9 


2/3 


90.18/90.21 


507 


Relief 


0.80 


12 + 9721 


3/5 


93.89/93.89 


520 


OneR 


0.57 


15 


12/23 


96.64/95.95 


892 



Table 4. Using feature subsets to induce J4.8 trees 



Feature subsets 


Feature 

reduction 


Time 


Leaves / 
Nodes 


Acc train / 
test 


Cost-benefit 

(Test) 


begin - RAND2 


0.57 


14 


17/33 


98.26 / 97.50 


1001 


RAND3-RAND1 


0.66 


12 


1 / 1 


79.42 / 79.45 


-910 


RAND 1 -end 


0.76 


11 


1 / 1 


79.42 / 79.42 


-913 


begin-RANDl 


0.23 


16 


17/33 


98.26 / 97.43 


1001 


RAND3-end 


0.42 


18 


1 / 1 


79.42 / 79.45 


-910 


begin-RAND2/RAND 1 - 
end 


0.33 


17 


21/41 


98.41 /97.18 


992 



4.3 Combination of Ranking and Wrapper Methods 

Although the ranking methods are very efficient, they have a flaw in that they do not 
take into account the possible interdependences between attributes. Observing the 
obtained results mention above, we propose a heuristic that looks for to overcome 
such a deficiency, combining the efficiency of the ranking methods, with the 
effectiveness of the wrapper methods. The heuristic involves the induction of a 
number of decision trees considering all subsets of attributes that a method produces 
(the subsets appear limited by the three random variables in Table 2). Applying the 
previous idea, we can observe, in a computationally economic way, if some 
combination of attributes exists in the subsets that improves the obtained results as 
compared when using only the first attribute subset. For example, the application of 
KL-2 with three random variables produces three subsets. The induction trees 
produced by J4.8 using the three subsets and a combination of these subsets are 
shown in Table 4. It is observed that, for this case, it does not exist a combination that 
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significantly improves the results of the first subset, and this is why we can conclude 
that we have found a good solution, one that manages to reduce to the processing time 
and the knowledge size, without affecting the tree quality of prediction. 



5 Conclusions and Future Work 

The feature selection ranking methods are very efficient because they only need to 
calculate the relevance of each isolated attribute to predict the class attribute. The 
disadvantages of these methods are that no uniform criterion is provided to decide 
which attributes are more relevant than others, and that no mechanism is included to 
detect the possible interdependences between attributes. In this article the integration 
of three random variables to the database is proposed to avoid a possible bias 
introduced to the result if a single random variable is used. We observed that, 
although some ranking methods agree in the selection of some attributes, in general, 
each method produces different attribute ordering, including the position for the three 
random variables. This is a very interesting result. The three variables serve as subset 
boundaries and help to decide which attributes to select. Also, we propose to analyze 
the possible interdependences between attributes using the induction trees constructed 
on these subsets. These ideas have been proven to be successful in a real world 
electrical energy customer-invoice database. In the future these ideas are going to be 
applied to other databases and classifiers. In particular we are going to perform more 
sumulations using the inclusion of multiple random variables to observe its utility like 
criterion within the feature selection area. 
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Abstract. When Case Based Reasoning systems are applied to real- 
world problems, the retrieved solutions in general require adaptations in 
order to be useful in new contexts. Therefore, case adaptation is a desir- 
able capability of Case Based Reasoning systems. However, case adapta- 
tion is still a challenge for this research area. In general, the acquisition of 
knowledge for adaptation is more complex than the acquisition of cases. 
This paper explores the use of a hybrid committee of Machine Learning 
techniques for automatic case adaptation. 



1 Introduction 

Case Based Reasoning (CBR) is a methodology for problem solving based on 
past experiences. This methodology tries to solve a new problem by retrieving 
and adapting previously known solutions of similar problems. However, retrieved 
solutions, in general, require adaptations in order to be applied to new contexts. 
One of the major challenges in CBR is the development of an efficient method- 
ology for case adaptation. In contrast to case acquisition, knowledge for case 
adaptation is not easily available and is hard to obtain [7,20]. 

The most widely used form of adaptation employs handcoded adaptation 
rules, which demands a significant effort of knowledge acquisition for case adap- 
tation, presenting a few difficulties [7]. Smyth and Keane [17], for example, pro- 
pose a case adaptation that is performed in two stages: first, it employs general 
adaptation specialists to transform a solution component target. Then, it uses 
general adaptation strategies to handle problems that can arise from the activi- 
ties of the specialists. These adaptation specialists and strategies are handcoded 
knowledge packages acquired specifically for a particular application domain. An 
alternative to overcome the difficulties in acquiring adaptation knowledge has 
been the use of automatic learning. This paper proposes a hybrid committee 
approach for case adaptation that automatically learns adaptation knowledge 
from a Case Base (CB) and applies it to adapt retrieved solutions. 

This paper is organized as follows: Section 2 briefly introduces the CBR 
paradigm. Section 3 presents some considerations about case adaptation. Sec- 
tion 4 introduces the proposed for case adaptation. Section 5 shows the evalua- 
tion of the proposed system. Section 6 presents the final considerations. 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 302-311, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 
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2 Case Based Reasoning 

CBR is a methodology for problem solving based on past experiences. This 
methodology tries to solve a new problem by employing a process of retrieval 
and adaptation of previously known solutions of similar problems. CBR systems 
are usually described by a reasoning cycle (also named CBR cycle), which has 
four main phases [1]: 

1. Retrieval: according to a new problem provided by the user, the CBR system 
retrieves, from a CB, previous cases that are similar to the new problem; 

2. Reuse: the CBR system adapts a solution from a retrieved case to fit the 
requirements of the new problem. This phase is also named ease adaptation] 

3. Revision: the CBR system revises the solution generated by the reuse phase; 

4. Retention: the CBR system may learn the new case by incorporating it into 
in the CB, which is named case learning. The fourth phase can be devided 
into the following procedures: relevant information selection to create a new 
case, index composition for this case, and case incorporation into the CB. 

CBR is not a technology developed for specific proposes; it is a general 
methodology of reasoning and learning [1,19]. CBR allows unsupervised and 
incremental learning by updating the CB when a solution for a new problem is 
found [1]. 



3 Case Adaptation 



When CBR systems are applied to real-world problems, retrieved solutions rarely 
can be directly used as adequate solutions for each new problem. Retrieved solu- 
tions, in general, require adaptations (second phase of the CBR cycle) in order to 
be applied to new contexts. Several strategies for case adaptation have been pro- 
posed in the literature [9,16]. They can be classified into three main groups: sub- 
stitutional adaptation, transformational adaptation, and generative adaptation. 

Case adaptation is one of the major challenges for CBR [7,20] and there is still 
much to be done concerning it. Case adaptation knowledge is harder to acquire 
and demands a significant knowledge engineering effort. An alternative to over- 
come such difficulties has been the use of automatic learning, where case adapta- 
tion knowledge is extracted from previously obtained knowledge, the CB. For ex- 
ample, Wiratunga et al. [20] proposed an inductive method for automatic acqui- 
sition of adaptation knowledge from a CB. The adaptation knowledge extracted 
from the CB is used to train a committee of Rise algorithms [4] by applying 
Boosting [6] to generate different classifiers. However, the knowledge generation 
process proposed is specific for certain design domains due the specific encoding 
employed for the adaptation training patterns and the extraction of differences 
between description attributes and between component solution attributes. 
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4 Proposed Approach 

This work proposes the use of committees of Machine Learning (ML) algorithms 
to adapt cases retrieved from a CB (second phase of the CBR cycle). The commit- 
tees investigated are composed of ML algorithms, here named estimators, based 
on different paradigms. One ML algorithm, here named combiner, combines the 
outputs of the individual estimators to produce the output of the committee. 
The estimators and the combiner are used to perform adaptations in domains 
with symbolic components (in this work, the case solution attributes are named 
components), extending the approaches presented in [12,13]. The committee is 
composed by the following ML algorithms: The ML algorithms are: 

— Estimators - a Multi Layer Perceptron (MLP) neural network [8]; a symbolic 
learning algorithm C4.5 [14]; a Support Vector Machine (SVM) technique 
[18], based on the statistical learning theory. 

— Combiner: in this work we investigated 2 ML algorithms as the combiner 
of the committee - a MLP neural network and the SVM technique. The 
combiner receives the outputs from the other three algorithms as input, 
combines the results, and produces the output of the committee. 

MLP networks are the most commonly used Artificial Neural Network model 
for pattern recognition. A MLP network usually presents one or more hidden 
layers with nonlinear activation functions (generally sigmoidal) that carry out 
successive nonlinear transformations on the input patterns. In this way, the 
intermediate layers can transform nonlinearly separable problems into linearly 
separable ones [8]. 

C4.5 is a symbolic learn algorithm that generates decision trees [14]. It builds 
a decision tree from a training data set by applying a divide-and-conquer strategy 
and a greedy approach. It uses a gain ratio to divide the training instances into 
subsets corresponding to the values of the selected attribute and calculate the 
gain ratio of the attribute from these subsets. This process is repeated for all 
input attributes of the training patterns, until a given subset contains instances 
of only one class. After the construction, the model may be complex or specific to 
the training data. Afterward, the model needs to be pruned in order to improve 
its performance. This process is carried out by eliminating those nodes that do 
not affect the prediction [14]. 

SVM is a family of learning algorithms based on statistical learning theory 
[18]. It combines generalization control with a technique that deals with the 
dimensionality problem^ [18]. This technique basically uses hyperplanes as deci- 
sion surface and maximizes the separation borders between positive and negative 
classes. In order to achieve these large margins, SVM follows a statistical princi- 
ple named structural risk minimization [18]. Another central idea of SVM algo- 
rithms is the use of kernels to build support vectors from the training data set. 

^ Machine Learning algorithms can have a poor performance when working on data 
sets with a high number of attributes. Techniques of attribute selection can reduce 
the dimensionality of the original data set. SVM is a ML Algorithm capable of 
obtaining a good generalization even for data sets with many attributes. 
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The proposed approach for case adaptation employs two modules. The first 
module (adaptation pattern generation) produces a data set of adaptation pat- 
terns. This data set is then used by the second module (case adaptation mecha- 
nism) that trains a committee of ML algorithms to automatically perform case 
adaptation. This approach extends the approach proposed in [12,13] by exploring 
the use of a hybrid committee of ML algorithms as case adaptation mechanism. 
This approach assumes that a CB is representative [15], i.e. the CB is a good 
representative sample of the target problem space. Therefore, no re-training of 
the adaptation mechanism is required when the system creates new cases during 
the reasoning process. 



4.1 Adaptation Pattern Generation 

The data set generation module proposed is capable of extracting implicit knowl- 
edge from a CB. This module employs an algorithm that is similar to that pro- 
posed in [12,13] (see Algorithm 1). 



Algorithm 1 Adaptation Pattern Generation 

function AdaptationPatternGenerate (CasesNumber, Component) 
for each cases from the original case base do 
ProofCase ProofCaseExtract () 

ProofDescrpt DescriptionExtract (ProofCase) 

ProofSolution SolutionExtract (ProofCase, Component) 

RetrievedCases Retrieve (ProofDescrpt, CasesNumber) 
for each RetrievedCases do 

RetDescrpt DescriptionExtract (RetrievedCases(i)) 

RetSolution SolutionExtract (RetrievedCases(i) , Component) 
MakeAdaptationPattern (ProofDescrpt, RetDescrpt, RetSolution, ProofSolution) 
end for 
end for 
end function 



Initially, the pattern generation algorithm extracts a case from the original 
CB and uses it as a new problem (ProofCase) to be presented to the CBR 
system. The remaining cases compose a new CB without the proof case. Next, 
the algorithm extracts, from the proof case, the attributes of the problem 
(ProofDescrpt) and a component (indicated by Component) of the solution 
(ProofSolution). Then, this algorithm returns the CasesNumber most similar 
cases from the ProofDescrpt (RetrievedCases), where CasesNumber is 
a predefined value. For each retrieved case, the attributes of the problem 
(RetDescrpt) and a component of the corresponding solution (indicated by 
Component) are extracted (RetSolution). Next, the algorithm generates the 
adaptation patterns using as input attributes: the problem description stored 
in the proof case, the problem description stored in the retrieved case, a 
component solution stored in the retrieved case; and as output attribute: a 
solution component stored in the proof case. Finally, the generated data sets are 
used to train the committee of ML algorithms. First, the MLP, the SVM, and 
C4.5 are trained individually using the adaptation pattern data set generated. 
Next, the output of these three ML algorithms are combined to produce a 
training data set for the the combiner of the committee (MLP or SVM) . 
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4.2 Case Adaptation Mechanism 

The proposed case adaptation mechanism allows the learning of the modifi- 
cations that need to be performed in the components values of the retrieved 
solutions in order to achieve an adequate solution for a new problem. The most 
important characteristic of this mechanism is the employment of implicit knowl- 
edge obtained from the CB with a minimum effort for the knowledge acquisition. 
The case adaptation process is shown in the Algorithm 2. 



Algorithm 2 Case Adaptation Mechanism 

function Adaptation (Description, RetrievedCase, Component) 

RetDescription DescriptionExtract (RetrievedCase) 

RetSolution SolutionExtract (RetrievedCase, Component) 

InputPattern MakeInputPattern (Description, RetDescription, RetSolution) 
Acts AdaptationMechanism (Normalization(InputPattern), Component) 
NewSolution Apply Acts (RetSolution, Acts, Component) 

return NewSolution 
end function 



When a new problem is presented to the CBR system, the most similar 
case stored in the CB is obtained by a retrieval mechanism [5,10]. This case 
(RetrievedCase) is sent to the adaptation mechanism together with the prob- 
lem description (Description). The adaptation algorithm, in turn, extracts the 
attributes from the new problem (RetDescription) . Next, for each component of 
the retrieved solution (indicated by Component), the algorithm extracts the cor- 
responding solution and generates an adequate input pattern for the committee 
of ML algorithms developed for this component. Then, the committee indicates 
the modifications in the component of the retrieved solution (Acts). Finally, these 
modifications are applied to the current component in order to obtain the so- 
lution for the new problem (NewSolution). The proposed adaptation approach 
works only with a single component of the solution of a case. This approach can 
be easily extended for domains where the solution of the cases has more than 
one component, by treating each solution component as a distinct problem. This 
strategy keeps this approach independent of the structure of the case solution. 

5 Empirical Evaluation 

This Section presents a set of experiments carried out to explore the use of com- 
mittees os ML algorithms and investigate if it introduces more precision and 
stability to the the system. For such, the performances obtained with the use 
of committees of ML algorithms are compared to those obtained by using in- 
dividual ML algorithms for case adaptation: a MLP nework, a C4.5 algorithm 
and a SVM technique. In order to show that the automatic case adaptation may 
result in considerable gain in the prediction of desired values for the solution 
attribute, both case adaptation approaches, using committees of ML algorithms 
and individual ML algorithms, have their performance compared with the per- 
formances obtained by the individual ML algorithms for the prediction of the 
solution attribute values. For the evaluation of the knowledge extraction algo- 
rithm, a data set from the UCI Machine Learning repository [2] was used. The 
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Table 1. Pittsburgh bridges case structure. 





Attribute 


Values 


Problem 


RIVER 


a, m, o 


LOCATION 


1..52 


ERECTED 


1818. .1986 


PURPOSE 


walk, aqueduct, rr, highway 


LENGTH 


804.. 4558 


LANES 


1, 2, 4, 6 


CLEAR-G 


N, G 


Solution 


T-OR-D 


through, deck 


MATERIAL 


wood, iron, steel 


SPAN 


short, medium, long 


REL-L 


S. S-F, F 


TYPE 


wood, suspen, simple-t, arch, cantilev, cont-t 



selected domain was the Pittsburgh bridges data set. This data set is originally 
composed of 108 cases. However, some cases contain missing values for some so- 
lution attributes. After removing cases with missing solution attributes values, 
the data set was reduced to 89 cases. The input missing attributes were filled by 
mean and median values. The case structure contains 5 discrete input attributes, 
2 continuous input attributes and 5 discrete output attributes (see Table 1). In 
this domain, 5 output attributes (design description) are predicted from 7 input 
attributes (specification properties). 

A sample of a case and of a adaptation pattern generated for the solution 
component MATERIAL is shown in the Figure 1. The adaptation pattern input 
is composed by the attributes of the problem description from the proof case 
and the retrieved case. The adaptation pattern output is composed by the value 
of the solution component from the proof case: 

The topology of the MLP networks employed as estimator has 29 to 33 input 
units (depending on the solution component), a hidden layer with 30 neurons 
and 1 output neuron. The MLP networks were trained using the momentum 
backpropagation algorithm, with moment term equal to 0.2 and learning rate 
equal to 0.3. The C4.5 algorithm was trained using default parameters. The 
SVM Algorithm was trained using the Radial Basis Function kernel and default 
parameters. The MLP networks and C4.5 algorithm were simulated using the 
WEKA library^. The SVM algorithm was simulated using the LIBSVM library^. 
Three different adaptation pattern data sets were created by generating adapta- 
tion patterns using 1, 3, and 5 similar cases (see C asesN umber in Algorithm 1). 

The numerical values were normalized (see N ormalization in Algorithm 2) 
for the interval [0. . . 1]. For the MLP, SVM, and C4.5 techniques, the input sym- 
bolic values were transformed into orthogonal vectors of binary values. For the 
MLP and SVM, the symbolic solution components were also transformed in the 
same way. Additionally, the original data set was balanced using a technique of 

^ Available in http://www.cs.waikato.ac.nz/ml/weka/index.htm 
® Available in http://www.csie.ntu.edu.tw/cjlin/libsvm 
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Adaptation Pattern 







CT 13 TIEM 1!IM!WA'J WI13M 13 IM WM EM KTiliTiM K1 ITl liBTiTiM 





Fig. 1. Adaptation Pattern Sample. 



over-sampling [3]. This is by the fact that ML algorithms may fail learning on 
unbalanced data sets. However, further investigation is required to comprove if 
it is necessary. The tests followed the 1 0- fold-cross-validation strategy. The pat- 
terns were randomly divided into 10 groups (folds) with similar size. One fold 
was used as a test-fold (a set of new problems to be presented to the system) 
and the remaining 9 folds were considered as a training-fold (a set of previously 
stored cases) . Before the training of a ML algorithm using the training-fold, the 
over-sampling technique was applied to the training-fold. After the training of a 
ML algorithm using the over-sampled training-fold, the test-fold was presented 
to the system and the average absolute error was calculated. This process was re- 
peated for the remaining 9 folds. Next, the average and standard deviation of the 
absolute error for each training session were calculated. Table 2 shows the results 
of the tests carried out with the hybrid CBR systems, with individual classifiers 
and with committees, using the three settings for the parameter Cases Number 
(see Algorithm 1), indicated by for the column named K. The results obtained by 
the individual techniques employed alone (MLP, C4.5, and SVM) are also shown. 

In order to confirm the performance of the proposed approach, the authors 
used the t test for bilateral procedures which 99% of certainty [11]. The results 
achieved are shown in Table 3. 

Although the results show that the use of Committees do not outperform the 
best CBR with individual ML algorithm adaptation mechanism, the result show 
that, in general, committees of classifiers reduce the standard deviation of the 
results (introduces more stability to the system). The results obtained by the 
committees could be better if more estimators were employed. Additionally, the 
results show that the hybrid CBR approaches proposed obtained better predic- 
tion of the problems solution than the classifiers techniques used individually. 
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Table 2. Average error rates and standard deviation for the proposed approach. The 
two best results for each column are indicated with an (*). Comm, means Committee. 



















Average Absolute Error and standard deviation 


(%) 






□ 


Model 




K 




t-or- 


d 




rel-1 


span 


material 




typ 


e 




global 1 


CBR (C4.5) 


1 


31, 


, 11 


± 


9, 


, 02 


65 


, 24 ± 9, 


98 


45, 


, 74 ± 7, 


98 


32, 


30 


± 


6, 


95 


53 


, 19 ± 


8, 


02 


45, 


52 ± 8, 


39 


CBR (C4.5) 


3 


28, 


, 39 


± 


7, 


, 07 


54, 


27 ± 11, 


, 06 


50, 


, 27 ± 8, 


05 


33, 


03 


± 


3, 


54 


53 


, 18 ± 


9, 


98 


43, 


83 ± 7, 


94 


CBR (C4.5) 


5 


31, 


, 11 


± 


8, 


, 34 


55 


, IT ± 9, 


00 


52, 


, 17 ± 7, 


21 


35, 


03 


± 


5, 


43 


56, 


09 ± 


11. 


, 50 


45, 


91 ± 8, 


30 


CBR (SVM) 


1 


16, 


, 00 


± 


7, 


, T3 


*31 


, 33 ± 9 


59 


41, 


, 22 ± 9, 


25 


24, 


22 


± 


5, 


51 


48, 


22 ± 


13. 


, 70 


32, 


20 ± 9, 


16 


CBR (SVM) 


3 


6, 


00 


± 


5, 


59 


*31 


, 33 ± 9. 


59 


36, 


22 ± 10 


, 58 


13, 


22 


± 


6, 


77 


47, 


22 ± 


12. 


, 21 


*26 


, 80 ± 8 


, 95 


CBR (SVM) 


5 


*6. 


, 00 


± 


5 


, 59 


50, 


89 ± 15, 


56 


*36. 


, 22 ± 10, 58 


*10 


, oc 


1 =t 


: 5 


, 33 


*46 


to 

to 


13 


;, 41 


29, 


87 ± 10 


, 10 


CBR (MLP) 


1 


23, 


, 84 


± 


4, 


, 97 


55, 


27 ± 12, 


76 


45, 


74 ± 7, 


98 


34, 


12 


± 


6, 


69 


56 


, 99 ± 


7, 


32 


43, 


19 ± 7, 


94 


CBR (MLP) 


3 


24, 


, T5 


± 


4, 


, T7 


56 


, 08 ± 7, 


88 


50, 


27 ± 8, 


05 


29, 


39 


± 


4, 


58 


57 


, 71 ± 


8, 


03 


43, 


64 ± 6, 


66 


CBR (MLP) 


5 


24, 


, T5 


± 


4, 


, T7 


49, 


74 ± 10, 


57 


52, 


17 ± 7, 


21 


31, 


12 


± 


4, 


78 


55 


, 80 ± 


8, 


85 


42, 


72 ± 7, 


23 


CBR (Comm 


.MLP) 


1 


23, 


, 84 


± 


3, 


, 60 


55, 


27 ± 12, 


76 


44, 


01 ± 8, 


75 


34, 


12 


± 


6, 


69 


62, 


25 d= 


12. 


, 37 


43, 


90 ± 8, 


83 


CBR (Comm 


.MLP) 


3 


25, 


, 66 


± 


5, 


, 60 


54, 


27 ± 11, 


06 


50, 


27 ± 8, 


05 


30, 


21 


± 


5, 


75 


53, 


08 ± 


12. 


, 41 


42, 


70 ± 8, 


57 


CBR (Comm 


.MLP) 


5 


26, 


, 57 


± 


7, 


, 89 


49, 


74 ± 10, 


57 


51, 


26 ± 5, 


54 


31, 


12 


± 


4, 


78 


54, 


09 ± 


10. 


, 88 


42, 


56 ± 42 


, 56 


CBR (Comm 


.SVM) 


1 


*6. 


, 00 


± 


5 


, 59 


43, 


67 ± 16, 


17 


*35. 


, 33 ± 10, 07 


17, 


33 


± 


8, 


87 


*46 


>, 44 i 


: 6 


, 56 


29, 


75 ± 9, 


45 


CBR (Comm 


.SVM) 


3 


7, 


00 


± 


5, 


40 


42, 


56 ± 10, 


25 


36, 


33 ± 10 


, 70 


13, 


22 


± 


6, 


77 


51, 


33 ± 


11. 


, 81 


30, 


09 ± 8, 


99 


CBR (Comm 


.SVM) 


5 


7, 


00 


± 


5, 


40 


33 


, 33 ± 9, 


24 


36, 


67 ± 11 


, 31 


*10 


, OC 


1 ± 


: 5 


, 33 


48, 


22 ± 


11. 


, 44 


*27 


, 04 ± 8 


, 55 


C4.5 




28, 


, 38 


± 


9, 


, 22 


61 


, 61 ± 7, 


95 


45, 


, 74 ± 7, 


98 


31, 


39 


± 


6, 


95 


59 


, 69 ± 


7, 


54 


45, 


36 ± 7, 


93 


SVM 




19, 


, 00 


± 


7, 


, 96 


31, 


33 ± 10, 


, 30 


39, 


22 ± 11 


, 35 


24, 


22 


± 


5, 


51 


57, 


44 ± 


11. 


, 56 


34, 


24 ± 9, 


33 


MLP 




25, 


, 66 


± 


6, 


, 56 


50, 


63 ± 13, 


, 04 


45, 


, 74 ± 7, 


98 


31, 


30 


± 


6, 


43 


64, 


35 ± 


10. 


, 32 


43, 


54 ± 8, 


87 



Table 3. Results for the t Test. 



Compared Models 


Conclusion 


CBR (Committee SVM - 5) and CBR (SVM - 3) 


Similar Performance 


CBR (SVM - 3) and MLP 


CBR (SVM - 3) better than MLP 


CBR (SVM - 3) and SVM 


CBR (SVM - 3) better than SVM 


CBR (SVM - 3) and M5 


CBR (SVM - 3) better than M5 



This suggest that the adaptation pattern data set extracted from the CB con- 
tains patterns that are consistent (not conflicting) with each other. Therefore, 
this approach for adaptation knowledge learning may be a useful technique for 
real-world problem solving. 

6 Conclusions 

One of the major challenges in designing CBR systems is the acquisition and 
modelling of the appropriate adaptation knowledge [7]. In this work, a CBR 
system that uses a hybrid approach for case adaptation was proposed. This ap- 
proach uses a hybrid committee for case adaptation in a domain with symbolic 
solution components, extending the proposal described in [12,13], where a sim- 
ilar approach was tested using a single ML algorithm. The proposed approach 
employs a process of adaptation pattern generation that can reduce the effort 
for knowledge acquisition in domains that require substitutional adaptation (see 
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Section 3) . Besides, the hybrid approach proposed is not computationally expen- 
sive, since the generation of the adaptation patterns demands no comparisons 
between solution components. Preliminary results show that the committees do 
not outperform the best CBR with individual ML adaptation mechanism. This 
suggest the investigation of committees with a larger number of estimators, re- 
ducing the influence of individual estimators on the final response of a commit- 
tees. It would also be interesting to investigate the use of such committees for 
other domains. 
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Abstract. Several works point out class imbalance as an obstacle on 
applying machine learning algorithms to real world domains. However, 
in some cases, learning algorithms perform well on several imbalanced 
domains. Thus, it does not seem fair to directly correlate class imbalance 
to the loss of performance of learning algorithms. In this work, we develop 
a systematic study aiming to question whether class imbalances are truly 
to blame for the loss of performance of learning systems or whether 
the class imbalances are not a problem by themselves. Our experiments 
suggest that the problem is not directly caused by class imbalances, but 
is also related to the degree of overlapping among the classes. 



1 Introduction 

Machine learning methods have advanced to the point where they might be ap- 
plied to real world problems, such as in data mining and knowledge discovery. 
By being applied on such problems, several new issues that have not been previ- 
ously considered by machine learning researchers are now coming into light. One 
of these issues is the class imbalance problem, i.e., the differences in class prior 
probabilities. In real world machine learning applications, it has often been re- 
ported that the class imbalance hinder the performance of some standard classi- 
fiers. However, the relationship between class imbalance and learning algorithms 
is not clear yet, and a good understanding of how each one affects the other is 
lacking. In spite of a decrease in performance of standard classifiers on many im- 
balanced domains, this does not mean that the imbalance is the sole responsible 
for the decrease in performance. Rather, it is quite possible that beyond class 
imbalances yield certain conditions that hamper classifiers induction. 

Our research is motivated by experiments we had performed over some im- 
balanced datasets, for instance the sick dataset [9], that provided good results 
(99.65% AUC) even with a high degree of imbalance (only 6.50% of the examples 
belong to the minority class). In addition, other research works seems to agree 
with our standpoint [8]. 
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In this work, we develop a systematic study aiming to question whether class 
imbalances hindrance classifier induction or whether these deficiencies might be 
explained in other ways. To this end, we develop our study on a series of artificial 
datasets. The idea behind using artificial datasets is to be able to fully control all 
the variables we want to analyze. If we were not able to control such variables, 
the results may be masked or difficult to understand and interpret, under the risk 
of producing misleading conclusions. Our experiments suggest that the problem 
is not solely caused by class imbalances, but is also related to the degree of data 
overlapping among the classes. 

This work is organized as follow: Section 2 introduces our hypothesis regard- 
ing class imbalances and class overlapping. Section 3 presents some notes related 
to evaluating classifiers performance in imbalanced domains. Section 4 discusses 
our results. Finally, Section 5 presents some concluding remarks. 



2 The Role of Class Imbalance on Learning 

In the last years, several works have been published in the machine learning 
literature aiming to overcome the class imbalance problem [7,12]. There were 
even two international workshops, the former was sponsored by AAAI [5] and 
the latter was held together with the Twentieth International Conference on Ma- 
chine Learning [1]. There seems to exist an agreement in the Machine Learning 
community with the statement that the imbalance between classes is the major 
obstacle on inducing classifiers in imbalanced domains. 

Conversely, we believe that class imbalances are not always the problem. In 
order to illustrate our conjecture, consider the decision problem shown in Fig- 
ure 1. The problem is related to building a Bayes classifier for a simple single 
attribute problem that should be classified into two classes, positive and neg- 
ative. It is assumed perfect knowledge regarding conditional probabilities and 
priors. The conditional probabilities for the two classes are given by Gaussian 
functions, with the same standard deviation for each class, but the negative class 
having mean one standard deviation (Figure 1(a)) and four standard deviations 
(Figure 1(b)) apart from the positive class mean. The vertical lines represent 
optimal Bayes splits. 

From Figure 1, it is clear that the influence of changing priors on the positive 
class, as indicated by the dashed lines, is stronger in Figure 1(a) than in Fig- 
ure 1 (b) . This indicates that it is not the class probabilities the main responsible 
for the hinder in the classification performance, but instead the degree of over- 
lapping between the classes. Thus, dealing with class imbalances will not always 
help classifiers performance improvement. 



3 On Evaluating Classifiers in Imbalanced Domains 

The most straightforward way to evaluate classifiers performance is based on the 
confusion matrix analysis. Table 1 illustrates a confusion matrix for a two class 
problem having class values positive and negative. 
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(a) High overlaid instances (b) Low overlaid instances 

Fig. 1. A Simple Decision Problem 
Table 1. Confusion matrix for a two-class problem. 





Positive Prediction 


Negative Prediction 


Positive Class 


True Positive (TP) 


False Negative {FN) 


Negative Class 


False Positive (FP) 


True Negative {TN) 



From such matrix it is possible to extract a number of widely used metrics 
for measuring learning systems performance, such as Classification Error Rate, 
defined as Err = , or, equivalently. Accuracy, defined as Acc = 

IC+T-iy = 1 _ Err 

However, when the prior classes probabilities are highly different, the use of 
such measures might produce misleading conclusions. Error rate and accuracy 
are particularly suspect as performance measures when studying the effect of 
class distribution on learning since they are strongly biased to favor the majority 
class. For instance, it is straightforward to create a classifier having an accuracy 
of 99% (or an error rate of 1%) in a domain where the majority class proportion 
correspond to 99% of the instances, by simply forecasting every new example as 
belonging to the majority class. 

Other fact against the use of accuracy (or error rate) is that these metrics 
consider different classification errors as equally important. However, highly im- 
balanced problems generally have highly non-uniform error costs that favor the 
minority class, which is often the class of primary interest. For instance, a sick 
patience diagnosed as healthy might be a fatal error while a healthy patience 
diagnosed as sick is considered a much less serious error since this mistake can 
be corrected in future exams. 

Finally, another point that should be considered when studying the effect of 
class distribution on learning systems is that the class distribution may change. 
Consider the confusion matrix shown in Table 1. Note that the class distribution 
(the proportion of positive to negative instances) is the relationship between the 
first and the second lines. Any performance metric that uses values from both 
columns will be inherently sensitive to class skews. Metrics such as accuracy and 
error rate use values from both lines of the confusion matrix. As class distribution 
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changes these measures will change as well, even if the fundamental classifier 
performance does not. 

All things considered, it would be more interesting if we use a performance 
metric that disassociates the errors (or hits) that occurred in each class. From 
Table 1 it is possible to derive four performance metrics that directly measure 
the classification performance on positive and negative classes independently: 

— False negative rate: FN^ate = tp+fn percentage of positive cases 

misclassified as belonging to the negative class; 

— False positive rate: F Prate = is the percentage of negative cases 

misclassified as belonging to the positive class; 

— True negative rate: TNrate = fp+tn percentage of negative cases 

correctly classified as belonging to the negative class; 

— True positive rate: T Prate = tp+fn percentage of positive cases 

correctly classified as belonging to the positive class; 

These four performance measures have the advantage of being independent 
of class costs and prior probabilities. The aim of a classifier is to minimize the 
false positive and negative rates or, similarly, to maximize the true negative 
and positive rates. Unfortunately, for most real world applications, there is a 
tradeoff between FNrate and F Prate and, similarly, between TNrate and T Prate- 
The ROC^ graphs [10] can be used to analyze the relationship between FNrate 
and FPrate (or TNrate and TPrate) for a classifier. 

A ROC graph characterizes the performance of a binary classification model 
across all possible trade-offs between the classifier sensitivity (TPrate) and false 
alarm (FPrate)- ROC graphs are consistent for a given problem even if the 
distribution of positive and negative instances is highly skewed. A ROC analysis 
also allows the performance of multiple classification functions to be visualized 
and compared simultaneously. A standard classifier corresponds to a single point 
in the ROC space. Point (0, 0) represents classifying all instances as negative, 
while point (0, 1) represents classifying all instances as positive. The upper 
left point (0, 1) represents a perfect classifier. One point in a ROC diagram 
dominates another if it is above and to the left. If point A dominates point B, A 
outperforms B for all possible class distributions and misclassification costs [2]. 

Some classifiers, such as the Naive Bayes classifier or some Neural Networks, 
yield a score that represents the degree to which an instance is a member of 
a class. Such ranking can be used to produce several classifiers, by varying the 
threshold of an instance pertaining to a class. Each threshold value produces a 
different point in the ROC space. These points are linked by tracing straight 
lines through two consecutive points to produce a ROC curve^. For Decision 
Trees, we could use the class distributions at each leaf as score or, as proposed 

^ ROC is an acronym for Receiver Operating Characteristic, a term used in signal 
detection to characterize the tradeoff between hit rate and false alarm rate over a 
noisy channel. 

^ Conceptually, we may imagine varying a threshold from — oo to -too and tracing a 
curve through the ROC space 
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Fig. 2. Pictorial representation of some instances of the artificial datasets employed in 
the experiments. 



in [3] , by ordering the leaves by its positive class accuracy and producing several 
trees by re-labelling the leaves, once at time, from all forecasting negative class 
to all forecasting positive class in the positive accuracy order. 

The area under the ROC curve (AUC) represents the expected performance 
as a single scalar. The AUC has a known statistical meaning: it is equivalent 
to the Wilconxon test of ranks, and is equivalent to several other statistical 
measures for evaluating classification and ranking models [4]. In this work, we 
use the AUC as the main method for assessing our experiments. The results of 
these experiments are shown in the next section. 

4 Experiments 

As the purpose of our study is to understand when class imbalances influence 
the degradation of performance on learning algorithms, we run our experiments 
on a series of artificial datasets whose characteristics we are able to control, thus 
allowing us to fully interpret the results. This is not the case when real datasets 
are used, as we stated before. 

The artificial datasets employed in the experiments have two major con- 
trolled parameters. The first one is the distance between the centroids of the 
two clusters, and the second one is the grade of imbalance. The distance be- 
tween centroids let us control the “level of difficulty” of correctly classifying the 
two classes. The grade of imbalance let us analyze if imbalance is a factor for 
degrading performance by itself. 

The main idea behind our experiments is to analyze if class imbalance, by 
itself, can degrade the performance of learning systems. In order to perform 
this analysis, we created several datasets. These datasets are composed by two 
clusters: one representing the majority class and the other one representing the 
minority class. Figure 2 presents a pictorial representation of four possible in- 
stances of these datasets in a two-dimensional space. 

We aim to answer several question analyzing the performance obtained on 
these datasets. The main questions are: 

— Is class imbalance a problem for learning systems as it is being stated in 
several research works? In other words, will a learning system present low 
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performance with a highly imbalanced dataset even when the classes are far 
apart? 

— The distance between the class clusters is a factor that contributes to the 
poor performance of learning systems in an imbalanced dataset? 

— Supposing that the distance between clusters matters in learning with imbal- 
anced datasets, how class imbalance can influence the learning performance 
for a given distance between the two cluster? 

The following section provides a more in deep description of the approach we 
used to generate the artificial datasets used in the experiments. 



4.1 Experiments Setup 

To evaluate our hypothesis, we generated 10 artificial domains. Each artificial 
domain is described by 5 attributes, and each attribute value is generated at 
random, using a Gaussian distribution, with standard deviation 1. Jointly, each 
domain has 2 classes: positive and negative. For the first domain, the mean of 
the Gaussian function for both classes is the same. For the following domains, 
we stepwise add 1 standard deviation to the mean of the positive class, up to 9 
standard deviations. For each domain, we generated 12 datasets. Each dataset 
has 10.000 instances, but having different proportions of instances belonging to 
each class, considering 1%, 2.5%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 
45% and 50% of the instances in the positive class, and the remainder in the 
negative class. 

Although the class complexity is quite simple (we generate datasets with 
only two classes, and each class is grouped in only one cluster), this situation is 
often faced by machine learning algorithms since most of them, for classification 
problems, follow the so-called separate-and-conquer strategy, which recursively 
divides and solves smaller problems in order to induce the whole concept. Fur- 
thermore, Gaussian distribution might be used as an approximation of several 
statistical distributions. 

To run the experiments, we chose the algorithm for inducing decision trees 
G4.5 [11]. G4.5 was chosen because it is quickly becoming the community stan- 
dard algorithm when evaluating learning algorithms in imbalanced domains. All 
the experiments were evaluated using 10-fold cross validation. As discussed in 
Section 3, we used the area under the ROG curve (AUG) as a quality measure. 
We also implemented the method proposed in [3] to obtain the ROG curves and 
the corresponding AUGs from the standard classifiers induced by G4.5. 

4.2 Results 

The results obtained by applying G4.5 in the artificially generated datasets are 
summarized in Table 2, which shows the mean AUG value and the respective 
standard deviation in parenthesis, of the classifiers induced by G4.5 for all the 
datasets having different class priors and different distances between the positive 
and negative class centroids. We omitted the values of AUG for the datasets 
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Table 2. AUC obtained from classifiers induced by C4.5 varying class priors and class 
overlapping 



Positive 

instances 


Distance of Class Centroids 


0 


1 


2 


3 


9 


1% 

2.5% 

5% 

10% 

15% 

20% 

25% 

30% 

35% 

40% 

45% 

50% 


50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 
50.00% (0.00%) 


64.95% (9.13%) 
76.01% (6.41%) 
81.00% (2.86%) 
86.69% (2.11%) 
88.41% (2.37%) 
90.62% (1.44%) 
90.88% (1.18%) 
90.75% (0.81%) 
91.19% (0.94%) 
90.91% (0.99%) 
91.73% (0.79%) 
91.32% (0.68%) 


90.87% (6.65%) 
95.82% (3.11%) 
98.25% (1.45%) 
98.22% (1.14%) 
98.92% (0.75%) 
99.08% (0.42%) 
99.33% (0.32%) 
99.24% (0.29%) 
99.36% (0.43%) 
99.46% (0.10%) 
99.44% (0.22%) 
99.33% (0.19%) 


98.45% (2.44%) 
97.95% (2.12%) 
98.95% (1.11%) 
99.61% (0.55%) 
99.68% (0.49%) 
99.90% (0.21%) 
99.90% (0.14%) 
99.86% (0.14%) 
99.91% (0.08%) 
99.90% (0.13%) 
99.90% (0.09%) 
99.87% (0.13%) 


99.99% (0.02%) 
99.99% (0.02%) 
100.00% (0.00%) 
99.99% (0.02%) 
99.99% (0.02%) 
99.99% (0.02%) 
99.98% (0.03%) 
99.99% (0.02%) 
99.99% (0.02%) 
99.99% (0.03%) 
99.98% (0.04%) 
99.99% (0.03%) 



having a distance of class centroids greater or equal than 4 standard deviations 
since the results are quite similar to the datasets having a distance of 3 standard 
deviations. Furthermore, for those datasets the difference of AUC are statistically 
insignificant, with 95% of confidence level, for any proportion of instances in each 
class. The results with the dataset having class centroids 9 standard deviations 
apart is included in order to illustrate the small variation between them and the 
previous column. 

As expected, if both positive and negative classes have the same centroids, 
we have a constant AUC value of 50%, independently of class imbalance. This 
AUC value means that all examples are classified as belonging to the majority 
class. 

Consider the column where the centroids of each class are 1 standard de- 
viation appart. If this column is analyzed solely, someone may infer that the 
degree of class imbalance on its own is the main factor that influences the learn- 
ing process. The AUC has an upward trend, increasing from nearly 65% when 
the proportion of instances of positive class is 1% to more than 90% when the 
proportion of positive and negative instances are the same. However, when the 
class centroids distance goes up to 2 standard deviations, we can see that the 
influence of the class priors becomes weaker. For instance, the value of AUC 
for the classifiers induced with the dataset having 1% and 2.5% of instances in 
the positive class and the centroid of this class 2 standard deviations apart the 
centroid of the negative class is still worst than the classifiers induced changing 
the class distribution and the same centroids, but the values of AUC are closer 
than the values with the same proportion and the difference of the centroids is 
1 standard deviation. 

For classifiers induced with datasets having 3 or more standard deviations 
apart, the problem becomes quite trivial, and the AUC values are nearly 100% 
regardless of the class distribution. 

For a better visualization of the overall trends, these results are shown graph- 
ically in Figure 3 and 4. These graphs show the behavior of the C4.5 algorithm 
assessed by the AUC metric in both class imbalance and class overlapping. 
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Fig. 3. Variation in the proportion of positive instances versus AUC 



Figure 3 plots the percentage of positive instances in the datasets versus the 
AUC of the classifiers induced by C4.5 for different centroids of positive class (in 
standard deviations) from the negative class. The curves with centroids of posi- 
tive class 3 to 8 standard deviations apart are omitted for a better visualization, 
but the curves are quite similar to the curve with centroid 9 standard deviations 
apart the negative class. Consider the curves of positive class where the class 
centroids are 2 and 3 standard deviations apart. Both classifiers have good per- 
formances, with AUC higher than 90%, even if the proportion of positive class is 
barely 1%. Particularly, the curve where the positive class centroid is 9 standard 
deviations from the negative class centroid represents almost a perfect classifier, 
independently of the class distribution. 

Figure 4 plots the variation of centroids distances versus the AUC of the 
classifiers induced by C4.5 for different class imbalances. The curves that rep- 
resent the proportion of positive instances between 20% and 45% are omitted 
for visualization purposes since they are quite similar to the curve that repre- 
sents equal proportion of instances in each class. In this graph, we can see that 
the main degradation in the classifiers performances occurs mainly when the 
difference between the centre of the positive and negative class is 1 standard de- 
viation. In this case, the degradation is significantly higher for highly imbalanced 
datasets, but decreases when the distance between the centre of the positive and 
negative class increases. The differences in performance of classifiers are statis- 
tically insignificant when the difference between the centers goes up 4 standard 
deviations, independently on how many instances belongs to the positive class. 
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Fig. 4. Variation in the centre of positive class versus AUC 



Analyzing the results, it is possible to see that class overlapping have an 
important role in the concept induction, even stronger than class imbalance. 
Those trends seem to validate our formerly hypothesis, presented in Section 2. 

5 Conclusion and Future Work 

Class imbalance is often reported as an obstacle to the induction of good clas- 
sifiers by machine learning algorithms. However, for some domains, machine 
learning algorithms are able to achieve meaningful results even in the presence 
of highly imbalanced datasets. In this work, we develop a systematic study using 
a set of artificially generated datasets aiming to show that the degree of class 
overlapping has a strong correlation with class imbalance. This correlation, to 
the best of our knowledge, has not been previously analyzed elsewhere in the 
machine learning literature. A good understanding of this correlation would be 
useful in the analysis and development of tools to treat imbalanced data or in 
the (re)design of learning algorithms for practical applications. 

In order to study this question in more depth, several further approaches can 
be taken. For instance, it would be interesting to vary the standard deviations of 
the Gaussian functions that generate the artificial datasets. It is also worthwhile 
to consider the generation of datasets where the distribution of instances of the 
minority class is separated in several small clusters. This approach can lead the 
study of the class imbalance problem together with the small disjunct problem, as 
proposed in [6] . Another point to explore is to analyze the ROC curves obtained 
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from the classifiers. This approach might produce some useful insights in order 
to develop or analyze methods for dealing with class imbalance. Last but not 
least, experiments should also be conducted on real-world datasets in order to 
verify that the hypothesis presented in this work does apply to them. 
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Abstract. MEDLINE is a representative collection of medical documents 
supplied with original full-text natural-language abstracts as well as with 
representative keywords (called MeSH-terms) manually selected by the expert 
annotators from a pre-defined ontology and structured according to their 
relation to the document. We show how the structured manually assigned 
semantic descriptions can be combined with the original full-text abstracts to 
improve quality of clustering the documents into a small number of clusters. As 
a baseline, we compare our results with clustering using only abstracts or only 
MeSH-terms. Our experiments show 36% to 47% higher cluster coherence, as 
well as more refined keywords for the produced clusters. 



1 Introduction 

As science and technology continues to advance, the number of related documents 
rapidly increases. Today, the amount of related stored documents is incredibly 
massive. Information retrieval on the numerous documents has become an active field 
for study and research. 

MEDLINE database maintained by the National Library of Medicine 
(http://www.nlm.gov/) contains ca. 12 million abstracts on biology and medicine 
collected from 4,600 international biomedical journals, stored and managed in the 
format of extensible Markup Language (XML). MeSH (Medical Subject Headings) 
are manually added to each abstract to describe its content for indexing. Currently, 
MEDLINE supports various types of search queries. 

However, query-based information search is quite limited on performing searches 
for biological abstracts. The query-based search method may be appropriate for 
content-focused querying, but this is so only on the condition that the user is an expert 
in the subject and can choose the keywords for the items they are searching for. Thus 
it is very confusing and time-consuming for users who are not experts in biology or 
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medicine to search using queries because new techniques and theories pour out 
continuously in this particular field. Even expert users have troubles when they need 
to quickly find specific information because it is impossible to read every found 
document from the beginning to the end. We see that query based search method is 
not so efficient in both cases [1]. 

Unlike the general query-based search method, document clustering automatically 
gathers highly related documents into groups. It is a very important technique now 
that efficient searching must be applied to massive amounts of documents. Generally, 
unsupervised machine learning [2] method is applied for clustering. Then the features 
included in the instance are provided as input of the clustering algorithm in order to 
group together highly similar documents into a cluster. 

In recent years, concerns on processing biological documents have increased but 
were mostly focused on the detection and extraction of relations [3] [4] and the 
detection of keywords [5] [6]. Up to now, few studies on clustering of biological 
documents have been done, except for the development of TextQuest [1], which uses 
the method of reducing the number of terms based on the abstract of the document. 
Namely, this is simple method uses a cut-off threshold to eliminate infrequent terms 
and then feeds the result into an existing clustering algorithm. 

The present study proposes an improvement to clustering technique using the 
semantic information of the MEDLINE documents, expressed in XML. For the 
proposed method the XML tags are used to extract the so-called MeSH, the terms that 
represent the document, and give additional term weights to the MeSH terms in the 
input of a clustering algorithm. The greater the additional term weight given to the 
MeSH, the greater the role of the MeSH terms in forming clusters. This way the 
influence of the MeSH on the clusters can be controlled by adjusting the value of the 
additional term weighting parameter. 

For this study, the vector space model [7] with the cosine measure was used to 
calculate the similarity between the documents. The formula for calculating the 
coherence of the cluster was applied to evaluate the quality of the formed clusters, and 
the cluster key words were extracted based on the concept vector [8] for summarizing 
the cluster’s contents. 



2 Basic Concepts 

This section will present the theoretical background: the vector space model, which is 
the basis for document searching, the measurement of coherence for evaluating the 
quality of the clusters, and the extraction of keywords that represent the cluster. 



2.1 Vector Space Model 

The main idea of the vector space model [7] is expressing the documents by the 
vector of its term frequency with weights. The following procedure is used for 
expressing documents by vectors [9]: 

• Extract all the terms from the entire document collection. 

• Exclude the terms without semantic meaning (called stopwords). 
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• Calculate the frequency of terms for each document. 

• Treat terms with very high or low frequency as stopwords. 

• Allocate indexes from 1 to to each remaining term, where d is the number of 

such terms. Allocate indexes from 1 to « to each document. Then the vector space 
model for the entire group of documents is determined by the Jxn-dimensional 
matrix w = | |, where refers to the tf-idf (term frequency — inverse document 

frequency) value of the i-th term in j-th document. 



Measuring the similarity between two documents in the process of clustering is as 
important as the selection of the clustering algorithm. Here we measure the similarity 
between two documents by the cosine expression widely used in information retrieval 
and text mining, because it is easy to understand and the calculation for sparse vectors 
is very simple [9] . 

In this study, each of the document vector was normalized in the unit of 

the Lj norm. The normalization here only maintains the terms’ direction, to have the 
documents with the same subject (i.e., those composed of similar terms) converted to 
similar document vectors. With this, the cosine similarity between the document 
vectors x. and x.can be derived by using the inner product between the two vectors: 



s(Xi ,Xj) = xf Xj =11 Xj II II Xj II cos(0(X; , Xj )) = cos(0(X; , Xj )) 



The angle formed by the two vectors is 0 < 9(x^,x^) < n/2. 

When n document vectors are distributed into k disjoint clusters , the 

mean vector, or centroid, of the cluster n. is 



m, 



2T ; y(=7T . 



( 1 ) 



Then, when the mean vector m. is normalized to have a unit norm, the concept vector 
Cj can be defined to possess the direction of the mean vector [8]. 



c 



j 




( 2 ) 



The concept vector c- defined in this way has some important properties, such as the 
Cauchy-Schwarz inequality applicable to any unit vector z: 

(3) 

X£.7tj X^TTj 

Referring to the inequality above, it can be perceived that the concept vector c^ has the 
closest cosine similarity with all document vectors belonging to cluster tt.. 



2.2 Cluster Quality Evaluation Model 



As a measure of quality of obtained clusters we use the coherence of a cluster tt. 
(1 which based on formula (3) can be measured as 







(4) 



If the vectors of all the documents in a cluster are equal, its coherence is 1, the highest 
possible value. On the other hand, if the document vectors in a cluster are spread 
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extremely wide apart from each other, the average coherence would be close to 0. The 
coherence of cluster system 71 ^, 112 ,. ■.,n^ is measured using the objective function shown 
below [ 8 ]; 

G({^,}-=i) = Z (5) 

7=1 x,en:j 

The coherence measure can be successfully used when the number of clusters is 
fixed, as it is in our case (with a variable number of clusters it favors a large number 
of small clusters). 



2.3 Extracting Cluster Summaries 

To present the clusters to the user, we provide cluster summaries, for the user to 
understand easier the contents of the documents in the cluster and to be able to select 
the cluster of interest by examining only these summaries. As summaries we use 
representative keyword sets. 

Given n document vectors divided into k disjoint clusters , denote the 

keywords of the document cluster tt - by words.. A term is included in the summary 
word, if its term weight in the concept vector of cluster n . is greater that its term 
weight in the concept vectors of other clusters [ 8 ]: 

words. = {A:* word: \<k<d, q,. > I < m< c, m^j}, ( 6 ) 

where d is the total number of terms and is the weight of the 7 -th term of A-th 
cluster. Such summary word^ has the property of having the concept vector localized 
to the matching cluster. Because of this, it gives comparably good keywords. 



3 Term Weighing 

This section presents our method of advanced clustering technique combining the 
structured semantic information assigned to the MEDLINE documents with their full- 
text information. 



3.1 Medical Subject Headings (MeSH) 

MeSH is an extensive list of medical terminology. It has a well-formed hierarchical 
structure. MeSH includes major categories such as anatomy/body systems, organisms, 
diseases, chemicals and drugs and medical equipment. Expert annotators of the 
National Library of Medicine databases, based on indexed content of documents, 
assign subject headings to each document for the users to be able to effectively 
retrieve the information that explains the same concept with different terminology. 

MeSH terms are subdivided into Major MeSH headings and MeSH headings. 
Major MeSH headings are used to describe the primary content of the document, 
while MeSH headings are used to describe its secondary content. On average, 5 to 15 
subject headings are assigned per document, 3 to 4 of them being major headings. 
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MeSH annotation scheme also uses subheadings that limit or qualify a subject 
heading. Subheadings can also be major subheadings and subheadings according to 
the importance of the corresponding terms. 



3.2 Extraction of MeSH Terms from MEDLINE 

MEDLINE annotated documents are represented as XML documents. An XML 
document itself is capable of supporting tags with meaning. The MEDLINE 
documents are formed with Document Type Description (DTD) that works as the 
document schema, namely, NLM MEDLINE DTD (www.nlm.nih.gov/database/dtd/ 
nlmcommon.dtd). In the DTD, information on the MeSH is expressed with the 
<MeshHeadingList> tag as shown in Fig. 1. Fig. 2 shows the MeSH expressed by 
using the DTD of Fig. 1. 

<!ELEMENT MeshHeadingList (MeshHeading+)> 

<!ELEMENT MeshHeading (Descriptor, QualifierName* )> 

<!ELEMENT Descriptor (#PCDATA)> 

<!ATTLIST Descriptor MajorTopicYN (Y | N) "N"> 

<!ELEMENT QualifierName (#PCDATA)> 

<!ATTLIST QualifierName MajorTopicYN (Y | N) "N"> 

Fig. 1. NLM MEDLINE DTD (MeSH) 



<MeshHeadingList> 

<MeshHeadingxDescriptor MaJorTopicYN="Y">Acetabulum</Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N">Adolescence</Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N">Adult</Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N">Aged</Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N">Arthroscopy</Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N">Cartilage, Articular</Descriptor> 
<QualifierName MajorTopicYN="Y">injuries</QualifierName> 

<QualifierName MajorTopicYN="N">surgery</QualifierNamex/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N">Case Report </Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N">Female</Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N"">Human</Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor MajorTopicYN="N">Male</Descriptorx/MeshHeading> 
<MeshHeadingxDescriptor>Rupture</Descriptorx/MeshHeading> 

<MeshHeadingxDescriptor MajorTopicYN="Y">Surgical Procedures, Endoscopic</Descriptor> 
</MeshHeading> 

</MeshHeadingList> 

Fig. 2. MEDLINE Data Sample (MeSH) 

As observed in Fig. 1 and Fig. 2, the information on MeSH in DTD can be 
classified largely into two types, which can in turn be divided into two more specific 
types each. The first type refers to the terms appearing under the tag <Descripton> 
with MajorTopicYN attribute Y or N, which indicates that the terms appearing under 
the tag are major MeSH or MeSH, accordingly. The second type refers to terms 
appearing under the tag <QualifierName> with the MajorTopicYN attribute Y or N, 
which indicates that the terms appearing under the tag are major subheadings or 
subheadings. 
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In order to extract the four types of terms described above, the XML document is 
parsed and the DOM (www.w3.org/DOM) tree is produced. The terms extracted in 
this process are used to adjust the term weights in the document vector. 

3.3 Combining Abstract Terms with MeSH Terms 

MEDLINE documents include both full-text abstract taken from the original 
document (a scientific article published in a journal) as well as the MeSH terms. The 
MeSH terms represent important information about the document, since they are 
assigned by experts with great care and consideration. Thus it can be expected (and 
our experiments confirmed this) that clustering the documents according to their 
MeSH information would give better clusters than clustering using the original 
abstracts. 

However, combining the two types of information can lead to even better results. 
We combine the two sources of information by adjusting term weights in the 
document vectors used as feature weight by the clustering algorithm. 

Indeed, clustering results are highly influenced by the term weights assigned to 
individual terms; for better results, more important terms are to be assigned greater 
weight. Since MeSH terms are known to be more important, the original term weights 
for the words appearing under the corresponding headings are adjusted as follows: 

W;; +{p-\ — ); < Decriptor MajorTopic YN =" ¥" > 

‘ 4 + ln(yC>) 

w,, +(p^ — ); < Decriptor MajorTopic YN =" N" > (71 

■' 2x(4 + ln(yC>)) ' ^ 

w-' +ip — ); < SubHeading MajorTopic YN =''y> 

■' 2x(A + \a{p)) 

p 

w,j +(p ); < SubHeading MajorTopic YN =" N"> 

‘ 4 + ln(yO) 

In this formula, a coefficient p is used to control de degree in which the abstracts or 
the MeSH terms participate in the resulting weighting. With p close to 0, the MeSH 
terms do not receive any special treatment, so that the results are close to those of 
clustering using only abstracts. With very large p, the results are close to those of 
clustering using only MeSH terms. With intermediate values of p, the two types of 
information are combined. Our experimental results show that the best clusters are 
obtained with some intermediate values of p. 

The specific expressions in formula (7) were found empirically in such a way that 
the formula gives slightly different additional values to the terms according to their 
significance: about 33% and 16% of the value of p is added or subtracted from its 
original value. For example, when p = 0.3, additional term weights are 0.3 + 0.107, 
0.3 + 0.054, 0.3 - 0.054, and 0.3 - 0.107, respectively. 

After the term weights are modulated by the above formula, they are re-normalized 
since the former normalized value had been changed, see Fig. 3. 
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Fig. 3. The Main Algorithm. 



4 Experimental Results 

For the experiment, 4,872 documents have been extracted from MEDLINE edition 
published in 1996. Two groups of documents were formed: one group contained 
documents with abstract only and another one those with both abstract and MeSH 
terms. 

The MC program [10], which produces vectors from a given group of documents, 
was used to vectorize the documents. Stop words and the terms with frequency lower 
than 0.5% and higher than 15% were excluded. With this, the document group with 
abstracts only had 2,792 terms remaining and that with both abstract and MeSH, 
3,021. Then the value was calculated for each of the document groups and 

normalized to the norm to form 4,872 document vectors. 

To verify the proposed method, the MeSH terms were extracted from the document 
group with both abstract and MeSH and then for the extracted terms, the term weights 
of the corresponding terms of each document vector were modulated by (3-1). Then 
they were normalized to the norm. 

The standard spherical fe-means algorithm was implemented for testing these data. 
This is an efficient clustering algorithm that quickly produces the fixed number of 
clusters specified by the user. 

We clustered our test document set into a fixed number of 3 to 6 clusters (these 
numbers are of major interest for user interfaces providing document collection 
navigation support) and using different values of the parameter p, varying smoothly 
from abstract-only to MeSH-only strategies. The abstract-only strategy was used as a 
baseline. The quality of the clusters was measured as inter-cluster coherence, as 
explained in Section 2.2. The test results are as shown in Table 1, where the gain ratio 
is calculated relative to the abstract-only strategy. The best values in each column are 
emphasized. Eig. 4 shows the average (over different number of clusters; the right- 
most column of Table 1) coherence obtained with different values of the parameter p. 

As can be observed in the figure, there is a wide area of the values of p providing 
optimal values of coherence. In particular, such optimal values are 36 to 47% better 
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than those obtained with abstracts only and 10 to 15% better than with MeSH terms 
only. This justifies our idea of combination of these two sources of information in a 
non-trivial manner. 



Table 1. Experimental Results. 

N stands for the number of clusters, C for average coherence of the obtained clusters, and R for 
gain rate relative to the abstract-only clustering. 





N 


= 3 


N 


= 4 


N 


= 5 


N 


= 6 


Average 


p 


C 


R 


C 


R 


C 


R 


C 


R 


C 


R 


Abstracts only 


1065 


0% 


1126 


0% 


1157 


0% 


1234 


0% 


1146 


0% 


p = 0 


1185 


11.3% 


1217 


8.1% 


1280 


10.6% 


1327 


7.5% 


1252 


9.3% 


P= 1 


1480 


39.0% 


1561 


38.6% 


1619 


39.9% 


1679 


36.1% 


1585 


38.3% 


II 

Q. 


1535 


44.1% 


1560 


38.5% 


1650 


42.6% 


1717 


39.1% 


1616 


41.0% 


p= 10 


1557 


46.2% 


1631 


44.8% 


1684 


45.5% 


1720 


39.4% 


1648 


43.8% 


p = 20 


1560 


46.5% 


1616 


43.5% 


1692 


46.2% 


1716 


39.1% 


1646 


43.7% 


o 

o 

It 

C. 


1570 


47.4% 


1641 


45.7% i 


1699 


46.8% 1 


1746 


41.5% 1 


1664 


45.3% 


p = 500 


1550 


45.5% 


1619 


43.8% 


1685 


45.6% 


1743 


41.2% 


1649 


44.0% 


p = 1000 


1558 


46.3% 


1585 


40.8% 


1664 


43.8% 


1738 


40.8% 


1636 


42.8% 


p = 2000 


1 563 


46.8% 


1601 


42.2% 


1688 


45.9% 


1 655 


34.1% 


1627 


42.0% 


MeSH only 


1361 


27.8% 


1422 


26.3% 


1535 


32.7% 


1561 


26.5% 


1470 


28.3% 



1800 - 




Fig. 4. Experimental Results: Average coherence as function of the parameter p. 



As an additional evidence of the improvement in cluster quality, we extracted the 
keyword summaries from the clusters produced with the baseline procedure and with 
our method, accordingly. For this, we used the formula (6). Table 2 represents the 
summaries for the five clusters obtained using abstracts only and Table 3 for the 
clusters obtained with p = 100, which gives the best clusters according to the 
coherence measure. Fifteen key words with the greatest term weight were extracted 
from the five clusters. 

One can easily observe that the keyword summary is more consistent and natural 
for the clusters obtained with a non-trivial value of the parameter p. 
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Table 2. Keyword summaries for clustering based only on abstracts. 

Cluster Key Words 

j cells, protein, cell, dna, proteins, expression, receptor, gene, beta, binding, alpha, 
human, acid, activity, kinase 

2 care, health, medical, united, states, usa, medicine, apr, management, din, research, 
cancer, nursing, dis, nurse 

2 hiv, mice, virus, peptide, infected, California, muscle, model, signaling, mucosal, wild, differentiation, 
class, produced, signal 

^ patients, group, treatment, clinical, cases, trial, study, patient, disease, surgery, risk, 
years, age, hospital, children 

j heart, coronary, rats, cardiac, ventricular, cardiology, artery, aim, myocardial, pressure, failure, 
atrial, flow, exercise, blood 



Table 3. Keyword summaries for clustering by abstracts and MeSH terms {p = 100). 

Cluster Key Words 

j cell, cells, mice, pathology, expression, growth, factor, tumor, cultured, kinase, induced, 
antigens, hiv, membrane, beta 

2 human, health, care, united, states, medical, nursing, drug, agents, trjedicine, disease, research, 
patient, therapy, hospital 

2 diagnosis, ferruile, male, age, case, adult, middle, aged radiography, diseases, heart, child neoplasms, 
ultrasonography, adolescence 

^ animal rats, support, receptors, chemical effects, brain, activity, wistar, sprague, dawley, 
inhibitors, antagonists, muscle, rat 

2 sequence, dna, proteins, protein, molecular, acid, amino, binding, data, gene, structure, base, 
genetic, recombinant, genes 



5 Conclusion 

Medical documents in the MEDLINE database contain both original full-text natural- 
language abstract and structured keywords manually assigned by expert annotators. 
We have shown that a combination of these two sources of information provides 
better features for clustering the documents than each of the two sources 
independently. We have shown a possible way of their combination, depending on a 
parameter that determines the degree of contribution of each source. Namely, we 
combined them by adjusting the term weights of the corresponding terms in the 
document vectors, which were then used as features by a standard clustering 
algorithm. 

Our experimental results show that there exists a wide area of the values of the 
parameter that gives a stable improvement in the quality of the obtained clusters, both 
in internal coherence of the obtained clusters and in the consistency of their keyword 
summaries. This implies, in particular, that our method is not sensible to a specific 
selection of the parameter. 

The improvement observed in the experiments was 36 to 47% in comparison with 
taking into account abstracts only and 10 to 15% in comparison with taking into 
account MeSH terms only. 
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Abstract. Fuzzy (valued) preference relations (FPR) give possibility to take 
into account the intensity of preference between alternatives. The refinement of 
crisp (non-valued) preference relations by replacing them with valued 
preference relations often transforms crisp preference relations with cycles into 
acyclic FPR. It gives possibility to make decisions in situations when crisp 
models do not work. Different models of rationality of strict FPR defined by the 
levels of transitivity or acyclicity of these relations are considered. The choice 
of the best alternatives based on given strict FPR is defined by a fuzzy choice 
function (FCF) ordering alternatives in given subset of alternatives. The 
relationships between rationality of strict FPR and rationality of FCF are 
studied. Several valued generalizations of crisp group decision-making 
procedures are proposed. As shown on examples of group decision-making in 
multiagent systems, taking into account the preference values gives possibility 
to avoid some problems typical for crisp procedures. 



1 Introduction 

The problem of decision-making (DM) may be considered as the problem of ranking 
of elements of some set of alternatives X or looking for the “best” alternatives from 
this set [6, 9]. Different approaches to these problems are varying in the structure of 
the set X, in the initial information about these elements, in the criteria used for 
ranking and evaluation of the “best” alternatives, etc. Most of intelligent systems 
include as a part some DM procedures, e.g. crisp and valued preference relations are 
used for modeling decision making in multiagent systems [5, 11, 12]. 

Valued (fuzzy) preference relations (FPR) give possibility to take into account the 
intensity of preference between alternatives. Different models of DM based on FPR 
have been considered in literature [1-3, 5-8, 10-12]. These models are usually based 
on a weak fuzzy preference relation R:XxX-^L defined on the set of alternatives X 
such that for all alternatives x,y the value R(x,y) is understood as a degree to which the 
proposition “a not worse than b” is true, or as intensity of preference of x over y etc. 
Usually it is supposed that the set of true values L coincides with interval [0,1]. In this 
case the operations on the set of FPR may be defined by means of fuzzy logic 
operations given on L. Generally L may denote some linearly ordered set of 
preference values [2, 7] For example, L may be a set of numerical values, the set of 
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scores {0, 1, 2, 3, 4, 5, 6} or the set of linguistic evaluations such as “very small 
preference”, “small preference”, “strong preference” etc. 

Usually a weak FPR and associated with it strict, indifference and incomparability 
fuzzy relations are considered [5, 6]. The properties of rationality of DM procedures 
are related with the properties of consistency of underlying FPR. These consistency 
properties are usually formulated in the form of transitivity or acyclicity of weak FPR 
and associated strict FPR. The types of consistency of strict FPR in the form of types 
of transitivity and acyclicity of these relations are considered in this work. The 
absence of desired requirement of consistency may be used for correction of given 
strict FPR by some formal procedure or for overestimation of preference values for 
some pair of alternatives. 

The more traditional approach considers the rationality of fuzzy choice function 
(FCF) with respect to the crisp set of non-dominated alternatives, which may be 
obtained as a result of the use of FCF [2, 3, 6, 8, 10]. The existence of such non- 
dominated set of alternatives is related with acyclicity of underlying FPR [2, 3]. This 
approach really reduces the problem to a non-valued, crisp choice functions and crisp 
acyclic relations and makes little use of information about valued preferences. In our 
work, we consider FCF as a ranking function and rationality conditions of FCF are 
formulated as rationality of rankings on all possible subsets of alternatives. 

The paper is organized as follows. The properties of consistency of strict fuzzy 
preference relations in terms of possible types of transitivity and acyclicity are studied 
in Section 2. The rationality conditions for FCF are considered in Section 3. The 
relationships between the consistency properties of strict FPR and rationality 
conditions of FCF are studied in Section 4. Example of application of DM procedures 
in multiagent systems is discussed in Section 5. Finally the conclusions and further 
directions of extension of proposed models are discussed. 



2 Strict Valued Preference Relations 



A valued relation on a universal set of alternatives is a function P.-Qx£2— where 
Lp is a linearly ordered set of preference values with minimum and maximum 
elements denoted as 0 and / respectively. We will consider here the set of preference 
values Lp= [0,1] used in fuzzy logic with ordering relation < defined by the linear 
ordering of real numbers, and with 0 = 0,/= 1 . Generally, many results related with 
strict FPR and FCF may be extended on the case of finite scale L^= (a^ a,, ..., aji 
with linearly ordered grades < Oj< ...< a„. Such a scale may contain numerical 
grades like L= [0,1, 2, 3, 4, 5, 6} or linguistic grades = {absence of preference, very 
small preference, small preference, average preference, strong preference, very 
strong preference, absolute preference) . For this reason we will consider here the 
terms valued preference relation and fuzzy preference relation as synonyms [2, 7]. 
The linear ordering relation < on L defines the operations a and v on L: a/\b=a and 
avb=b iff a < b (i.e. a < b or a=b) for all a,b from L. The negation operation ' may 
be introduced on Lp as follows: a' = \ - a for Lp= [0,1] and a{ = a„_^ for finite scale 
with n+1 grades. The operations on Lp satisfy De Morgan laws: (aAb)'=a'vb' and 
(as/bf =a' Ab' and the involution law: a"= a^. 




334 



I. Batyrshin, N. Shajdullina, and L. Sheremetov 



P will be called a FPR if it satisfies on Q the asymmetry condition: P(x,y)AP(y,x)= 
0. We will write P(x,y) > 0 if P(y,x) = 0. P(x,y) will be understood as a preference 
degree or intensity of preference of x over y. 

The following types of transitivity reflect the different types of consistency of P: 

- Weak transitivity: WT. From P(x,y) > 0 and P(y,z) > 0 it follows P(x,z) > 0. 

- Negative transitivity: NT. From P(x,y) > 0 and P(y,z) > 0 it follows P(x,z) > 0. 

- Transitivity: T. From P(x,y) > 0 and P(y,z)> 0 it follows P(x,z)> P(x,y)AP(y,z). 

- Strong transitivity: ST. From P(x,y) > 0 and P(y,z) > 0 it follows P(x,z)> 
P(x,y)vP(y,z). 

- Quasi-series: QS. From P(x,y) > 0 and P(y,z) > 0 it follows P(x,z)=P(x,y)vP(y,z). 

- Super- strong transitivity: strong transitivity together with the property 
SST. From P(x,y) > 0 and P(y,z)> 0 it follows P(x,z)> P(x,y)vP(y,z). 

Suppose P is a strict FPR P on Q and x^x^, x^(n> 2) are some elements of Q. 

Consider the following types of cycles, which may by induced by P in 

- 0-cycle: Pfx^xJ > 0, P(x,,X 2 )> 0, ..., P(x^_,,xJ > 0, P(x^,x„) > 0; 

- a-cycle: P(Xg,x,)> a, P(Xj,xJ>a, ... P(x^ j,xJ>a, P(x^,xJ>a, where aeL, a > 0; 

- max-cycle: P(x^xJ= P{Xj,xJ=...= P{x^_,,xJ = P(x^,xJ = a >P(x.,xJ for some aeL, 
a > 0, and all i,k e {0, 1, ..., n}. 

As a special case of a-cycle with a= /, I-cycle will be considered. It is clear that 
any /-cycle is a max-cycle. We will say that a strict FPR P satisfies one of the 
properties acyclicity (0-AC), a-acyclicity (a-AC), I-acyclicity (I-AC) and max- 
acyclicity (MAC) if it does not contain correspondingly 0-cycles, a-cycles, /-cycles 
and max-cycles. 

Proposition 1. The transitive and acyclic classes of strict FPR are partially ordered 
by inclusion as follows: 

QS aSTaNT aWT, SST aST aT aWTaQ-AC^a-ACaa-ACal-AC, 

MAC ^ I-AC, where a, and a^ are elements of L such that 0 < a,< a.< /. 

The fuzzy quasi-series is a direct fuzzification of crisp quasi-series considered in 
[2, 10]. A fuzzy quasi-series is related with a fuzzy quasi-ordering relation [13] and 
defines some hierarchical partition of the set of alternatives on the ordered classes of 
alternatives. The class of these relations due to their special structure is the narrowest 
class of strict FPR, whereas the class of I-AC is the widest class of strict FPR. 



3 Fuzzy Choice Functions 

Suppose is a linearly ordered set of evaluations of the respective quality of 
alternatives in the sets of alternatives AcQ. The minimum 0 and the maximum I 
elements of will be considered as evaluations of the “worst” and the “best” 
alternatives in X. Generally we will suppose that such the “worst” and the “best” 
alternatives may be absent in X which may contain for example only “good” or “not 
bad” alternatives. The set will be considered as a set of possible values of FCF 
related with the fuzzy preference relation P defined on Q.. For this reason will be 
tied with the set of values of correspondent FPR. In fuzzy context, it will be 
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supposed that = [0,1], but generally it may be a set of scores L^= {0, 1, 2, 3, 4, 5, 
6} or a set of suitable linguistic values. 

A fuzzy choice function C is a correspondence which defines for each finite set of 
alternatives the function C^:X—^L^. In such definition, the FCF is really a score 
function measuring alternatives in the scale and defining some linear ordering of 
alternatives from given set X. The possible properties of rationality of these orderings 
on different sets of alternatives XqQ. are discussed in [2]. We consider here several 
new conditions of rationality of FCF. In the following, for any FCF the fulfillment of 
the trivial choice property will be required: 

TC. ( Vx^n) CJx) = I. 

This condition says that any element x is “the best” in the set containing only this 
element. Another, more strong condition requires that in any set of alternatives “the 
best” alternative exists: 

BC. ( VX<^Q){3xeX) C^(x) = I. 

This axiom is a very strong requirement because the set of “the best” alternatives in 
general may be empty. As shown in DM theory, the choice functions generated by 
preference relations fulfill the similar condition if the correspondent preference 
relation is acyclic [2, 3]. This problem will be discussed also below. 

The following two conditions are some weakening of the previous one. 

b-UAC. ( VX^Q)(3xgX) CJx) > b, 

where beL, (b > 0) is some level of unacceptability of the quality of alternatives 
chosen from a given set of alternatives. The “good” alternative should have the 
quality, which is greater than this level. The next special case of the previous 
condition requires the existence of “not the worst" alternatives: 

NWC. ( VX<^Q)(3xeX) CJx) > 0. 

This rationality condition requires that in any set of alternatives the rational choice 
function can select “not the worst” alternatives. Another possible requirement on FCF 
requires the existence of nontrivial ordering of alternatives: 

NTO. ( VX<^Q) ((Ixl >2)^(3x,yeX) (x^)&(CJx) > CJy)). 

The stronger condition on FCF requires that the “best” alternatives should be 
“standard” element: 

ES. ( VX^Q) ( VxeX)((CJx)=I)-^( VyeX)(CJy)= C,Jy))). 

This condition is very strong. For choice functions satisfying this condition it 
follows, for example, that if we have “the best” element in some set X then for 
evaluating the quality of another alternatives in X it is sufficient to compare these 
alternatives only with this standard (or “ideal”) element and other alternatives in X 
may not be considered in this evaluation. 

The condition of dependence of strict orderings: 

DSO. ( VX,Yai2)( Vx,yeXnY)((CJx) > CJy)) ^ (CJx) > CJy))), 

requires that if x has a higher level of choice function than y in some set X then such 
situation takes place also in any set Y containing both alternatives. 
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Proposition 2. BC c b,-UAC c b,-UAC c NWC, NTOc NWC, ESnBCcDSO, 
where b. and b- are elements of L such that 0< b.< b< I. 

As shown in the following section, if the FCF is generated by strict FPR then all 
these conditions are characterized by some requirements on transitivity or acyclicity 
of this relation. 



4 Choice Functions Generated by Strict FPR 

Fuzzy choice functions may be generated by some FPR [10, 2] as follows: 

(x) = (max P(y,x))'= min (P(y,x))'. 
yG A yG A 

It is clear that the properties of choice function and strict preference relation 
generating this choice function are interrelated. It is clear also from asymmetry of 
fuzzy strict preference relations P and from definition of choice function that any 
choice function satisfies the property TC. We will need also in the following 
condition of weak completeness of linguistic strict preference relation: 

WC. From it follows P(x,y)vP(y,x)>0. 

Theorem 3. The diagram on Fig. 1 characterizes the FCFs and fuzzy strict 
preference relations generating these choice functions. 

On this diagram, A B denotes that the choice function satisfies the property A if 
and only if the strict preference relation generating this choice function satisfies the 
property B. A^ B denotes that from A follows B. For example, choice function 
satisfies the condition of b-UAC iff the strict preference relation P generating this 
choice function is a-acyclic with a=b'. 

► ^C1 

i i 

b-UAC^^ ^b'-AO 

i i 

NWC\^ p. I-AC^ 

t t 

NTO^M ► M4C+WC^ 



Fig.l. Relationships between the classes of choice functions and linguistic strict preference 
relations generating these choice functions. 
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5 Group Decisions 

In group DM, the intensity of preferences often plays important role. Consider an 
example. Five friends want to make decision where to go in the evening. Three of 
them slightly prefer bar to restaurant but other two strongly prefer restaurant to bar. If 
the intensity of preferences is not taken into account then applying for example simple 
majority rule, bar should be chosen. But usually, the intensity of preferences 
influences on the group decision and in this case, the restaurant may be chosen if the 
intensities of preference of restaurant by two friends are very strong. Different 
methods of aggregation of preference intensities have been proposed [6, 7]. Several 
methods of aggregation of FPR, which generalize the classical crisp methods, are 
considered further in this section. As shown by Arrow, a group decision procedure 
satisfying several axioms or rationality does not exist. Any such proposed procedure 
may be criticized from one or another point of view. The generalizations of some of 
these procedures on the case of valued preferences are not free from critique as well 
but they give the possibility to take into account the intensity of preferences and as a 
result, to diminish the drawbacks of crisp procedures. 



5.1 Valued Simple Majority Rule 

The draft formulation of the simple majority rule is the following: an alternative is a 
winner if it is placed on the first position by majority of agents. It may happen that 
several alternatives receive equal number of votes. In this case, some additional 
procedure of resolving such situations may be used [4]. The possible generalization of 
this method on fuzzy preference relations for linguistic evaluations of intensity was 
considered in [7]. We propose here a new method, which uses the fuzzy evaluations 
of intensity in strict FPR. 

The crisp simple majority rule takes into account only information about the best 
alternatives and may be considered as a procedure operating with the individual 
choice functions. The alternative x receives the vote V.(x) =1 if it belongs to the choice 
function of i-th agent. The sum of these votes defines the winner. The vote that 
alternative received from some agent may be considered as the value of characteristic 
function of choice function defined by linear ordering of all alternatives by this agent. 
Table 1 shows example of preference profile for 5 agents a, on the set of three 

alternatives X={x,y,zj, where, for example the first column denotes the ordering x> y 
> Z of alternatives correspondent to the agent a^. Here x and y have the maximum 
scores equal 2 correspondent to the number of their location on the first place in the 
preference profile. Table 2 contains the correspondent values of votes of alternatives 
in this preference profile. 

Table 1. Profile of 5 preferences Table 2. Choice functions and sum of votes for 

profile from Table 1. 
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Consider possible fuzzy generalization of simple majority rule based on the profile 
of strict FPR. Each strict FPR is replaced by a correspondent FCF, linearly ordering 
alternatives with respect to the value of choice function. The averaged sum of 
membership values in the FCFs for each alternative obtained for all agents is 
calculated. The alternative with the maximum value is considered as the solution of 
the group decision problem. 

For example of group decision problem for 5 friends considered above, the crisp 
and fuzzy simple majority rules give the following results. Suppose the fuzzy strict 
preferences between bar and restaurant for these 5 friends have the following values: 
P^(bar, restaurant) = 0.2, P^{bar, restaurant) = 0.3, PJJjar, restaurant) = 0.2, 
PJ^restaurant, bar) = 0.8, P ^{restaurant, bar) = 0.7. The corresponding matrixes of 
fuzzy strict preference relations are presented on Table 3. Table 4 contains 
corresponding crisp ranking when the intensity of preferences is not taken into 
account. Table 5 contains the scores obtained by all alternatives by the crisp simple 
majority rule. Table 6 contains FCFs defined by FPRs and the resulting scores of 
alternatives. As it can be seen, the crisp and fuzzy simple majority rules give different 
results because the fuzzy approach gives possibility to take into account the intensity 
of preferences which are not considered by the crisp approach. 



Table 3. Example of 5 fuzzy strict preference relations 
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Table 4. Profile of crisp preferences corresponding to Table 3 



a, 

bar bar bar restaurant restaurant 

restaurant restaurant restaurant bar bar 



Table 5. Crisp choice functions and sum of votes for profile from Table 4 

a, SUM/V) 

l^r i i i 0 0 3 

restaurant 0 0 0 1 1 2 

Table 6. Fuzzy choice functions and sum of votes for profile from Table 3 

a, SUM.(V) 

l^r i i i 0.2 0.3 

restaurant 0.8 0.7 0.8 1 1 



3.5 

4.3 
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Table 7. Example on fuzzy Condorcet winner rule 
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5.2 Fuzzy Condorcet Winner 

The crisp Condorcet winner rule directly uses the information about pair-wise 
preferences. For given preference profile, an alternative x is a Condorcet winner if in 
more than half preference relations from the profile it is more preferable than each 
other alternative. Unfortunately the Condorcet winner does not always exist. This 
situation happens for the example presented in Table 1. The alternative that is better 
than all other alternatives for more than 2 agents does not exist: the alternative x is 
better than y for 3 agents, y is better than z for 4 agents and z is better that x for 3 
agents. We obtain the circle: x>y, y>z, Z>x which does not give us the possibility 
to select or reject one of three alternatives. But if we consider valued preference 
relations then one alternative may be rejected. 

Let us define strict valued preference relation on the set of alternatives in the 
following way. Denote V(x,y) the number of agents which say that x is better than y. 
Define P(x,y) = max{0,V(x,y)-V(y,x)}/N, where A? is a total number of agents. Then for 
our example, we receive the strict valued preference relation shown in Table 7. The 
obtained strict FPR satisfies max-acyclicity MAX and weak completeness WC 
conditions and according to the Theorem 3 the choice function of this strict FPR 
satisfies the non-trivial ordering NTO condition and contains alternatives with 
different values of FCF. The FCF of this FPR is shown in the last string of Table 7. In 
comparison with x and y, the alternative z obtains the lower value of FCF and may be 
rejected. We should note that for the considered fuzzy Condorcet winner rule it is also 
possible to receive a strict FPR, which does not satisfy MAC condition such that all 
alternatives compose circle of preferences with equal values. But the possibility of 
such situation is much less than in the case of crisp Condorcet winner rule. It may be 
shown that for some classes of preference profiles the fuzzy Condorcet winner rule 
always will give FPR satisfying MAC condition and hence the correspondent FCF 
will satisfy the non-trivial ordering condition. 



6 Conclusions 

In the paper, we have presented the models of valued strict preferences, which can be 
used by agents addressing the problems of ranking and aggregation of their fuzzy 
opinions. We considered the consistency properties of strict FPR separately from the 
properties of some weak FPR. First, it gives us the possibility to analyze more fine 
structures of FPR related with the rationality properties of DM procedures. Second, 
many situations exist when initially given information about preferences may be 
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presented directly in the form of strict FPR, i.e. as asymmetric FRP. Such FPR may 
be received, for example, as a result of pair-wise comparison of all alternatives and 
replying on two questions: 1) What alternative from considered pair is more 
preferable? 2) If one of alternatives is more preferable, then what is the intensity of 
this preference? We think that for expert it is easier to evaluate his pair-wise 
preferences in a form given by strict FPR than to evaluate two intensities of 
preference: alternative a over b and alternative b over a for obtaining weak FPR. 

Another distinctive feature of considered approach is the way in which we study 
the rationality of the FCF. More traditional approach to FCF considers only crisp set 
of non-dominated alternatives, which lead to acyclicity of some underlying crisp 
preference relation and does not deal much with the intensity of preferences. Our 
approach really takes into account the information about intensity of pair-wise 
preferences in underlying FPR and gives a possibility to consider as the “good” 
alternatives, the alternatives dominated with a “low” value of intensity. This set of 
alternatives may be considered as a solution of a DM problem when the set of the 
“best”, non-dominated alternatives is empty. Our approach can give solution to the 
DM problem when more traditional approach does not work. The existence of the set 
of “good” alternatives is related with the properties of “weak” acyclicity of 
underlying strict FPR, which essentially use the intensity of pair-wise preferences. 
Such “weak” acyclicity properties admit some types of fuzzy circles in the strict FPR 
such that the choice of “good” alternatives may be done in the presence of such 
circles. The circles in crisp preference relations arise usually in multi-criteria 
evaluation of alternatives or like in Condorcet paradox when these relations are 
obtained as a result of aggregation of individual preference relations. In this case, the 
use of DM procedures based on strict FPR will decrease the possibility of arising 
circles in aggregated preference relation when the rational decision cannot be done. 

Central to this model is the incomparability relation that occurs when agents have 
conflicting information preventing them to come to a consensus. Here we have shown 
how the valued strict preferences can decrease the number of cycles and this way the 
number of conflicts for multi-agent DM. This model was shown to be applicable to 
both a single and multi-agent multi-criteria DM problem setting. Several 
generalizations of crisp group DM procedures have been proposed that give 
possibility to avoid some problems typical for crisp models. 
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Abstract. In recent years, agent-based computational models have been 
used to study financial markets. One of the most interesting elements 
involved in these studies is the process of learning, in which market par- 
ticipants try to obtain information from the market in order to improve 
their strategies and hence increase their profits. While in other papers 
it has been shown how this learning process is determined by factors 
such as the adaptation period, the composition of the market and the 
intensity of the signals that an agent can perceive, in this paper we shall 
discuss the effect of external information in the learning process in an 
artificial financial market (AFM). In particular, we will analyze the case 
when external information is such that it forces all participants to ran- 
domly revise their expectations of the future. Even though AMFs usually 
use sophisticated artificial intelligence techniques, in this study we show 
how interesting results can be obtained using a quite elementary genetic 
algorithm. 



1 Introduction 

In recent years it has become ever more popular to consider financial markets 
(FMs) from an evolutionary, rather than the traditional rational expectations, 
point of view [1,2]. In particular, there has been a substantial increase in studies 
that use agent-based, evolutionary computer simulations, known as Artificial Fi- 
nancial Markets (AFM) [3]. In this paper we use a particular AFM — the NNGP 
[4] — whose design was motivated by the desire to study relatively neglected el- 
ements in other AFMs (for instance, the Santa Fe Virtual Market [5]), such as 
the effect of organizational structure on market dynamics and the role of market 
makers and information. All of these elements are crucial in the formation of 
market microstructure [6]. 

Among the most interesting aspects one can study in an AFM — and con- 
stituting the central topic of this work — is the process of learning, a set of 
mechanisms which allows agents to modify their buy/sell strategy with the aim 
of adapting to the conditions imposed by the market. In particular, in this paper 
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we study the effect of external information on the learning process. For this pur- 
pose we have analyzed the extreme case where the arrival of information forces 
all participants to change their perception about the state of the market and 
therefore their expectations about its future evolution. 

Though the NNCP allows us to include many features that influence the 
behavior of the market, in this study we used only a small number of elements: 
informed and uninformed agents, adaptive agents, and information “shocks” in 
the market’s development. Despite this relatively modest diversity of behaviors 
and simplicity of elements, it was possible to conduct experiments that produced 
significant results. 

The structure of the next paper is as follows. In section 2 we describe the 
general form of the elements used in the NNCP, namely the market organization, 
the market participants, the information processes (“shocks”), and the learning 
mechanisms. In section 3 we explain the experiments conducted, along with a 
discussion of their main results. Finally, we give our conclusions, as well as some 
general ideas for future lines of research. 

2 The Building Blocks of the NNCP 

The workings of the NNCP are, in general terms, as follows: a simulation is 
carried out for a prescribed number of ticks on a single risky asset. An agent can 
divide his/her wealth between this risky asset and a riskless asset (“cash”). At 
each tick an agent — or a set of agents — takes a position (buy/sell/neutral). 
Shares are bought in fixed size lots of one share. Resources are finite and hence 
traders have portfolio limits associated with either zero cash or zero stock. Short 
selling is not permitted. 

2.1 Market Organization 

The market clearing mechanism we used for all our simulations in this partic- 
ular study is a simple double auction, where at every tick each trader takes a 
position with an associated volume and at a given price, each trader being able 
to value the asset independently but with prices that are not too different. In 
this model price changes are induced only via the disequilibrium between supply 
and demand. Specifically: 

1. At time (or “tick”) t one lists all the positions taken by the agents and the 
associated volume and price. The agents’ bids and offers are obtained via a 
Gaussian distribution with mean p = p(t — 1). 

2. A bid and an offer are matched only if they overlap, i.e. Pb{t) > Po{t). To 
realize a transaction we used: “best bid/ offer'" , where the highest bid and 
the lowest offer are matched at their midpoint successively until there are 
no overlapping bids and offers. 

After each tick price is updated exogeneously via a supply/demand type law 
as in Eq. (1). 

p{t + l)=p{f)[\ + r]{B{f)-0{f))\ (1) 
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In this equation, which is common to many AFMs, p{t) is price at tick t and B(t) 
and 0{t) are the demand and supply at t, while 77 is a tuning parameter. Note 
that D{t) = {B{t) — 0{t)) depends not only on the positions taken by the agents 
but also on the mechanism used to match their trades, e.g. at what price two 
contrary trades will be matched. In this sense one may think of a “bare” D{t), 
Dsit), that represents the imbalance in supply and demand associated purely 
with the desired trades of the agents while D{t) represents the residual imbal- 
ance after matching those orders that can be matched under a given clearing 
mechanism. 



2.2 Market Participants 



We will divide traders into various classes. Two of the principal classes are in- 
formed and uninformed, or liquidity, traders. The latter make random decisions, 
buying or selling with equal probability irrespective of the market price. In- 
formed agents on the other hand have a higher probability to buy than sell. One 
can try to rationalize this behavior in different ways, each rationalization being 
equally legitimate in the absence of further information. One can, for instance, 
imagine that informed agents have a better understanding of the market dynam- 
ics in that they “know” that in the presence of uninformed traders the excess 
demand the informed trader’s bias generates will translate itself, via Eq. ( 1) , into 
a price increase which will augment their portfolio values at the expense of the 
uninformed. Alternatively, one may simply imagine that the informed traders 
believe the market will rise. We will, in fact, consider a one-parameter family of 
informed traders described by a “bias”, d, where the position probabilities are: 



P(c) = 



2d 

Y’ 



P{n) = 



P{v) 



2(1 -d) 
3 



(2) 



where c represents Buy, v Sell and n Hold. For example, when d = 1/2 then 
the corresponding probabilities are 1/3, 1/3, 1/3; which actually corresponds to 
an uninformed trader, i.e. a trader having no statistical bias in favour of one 
position versus another. In contrast, a trader with d = 1 has probabilities 2/3, 
1/3, 0 and corresponds to a trader with a strong belief that the market will rise 
or, alternatively, to a trader who believes that there are many uninformed traders 
in the market that can be exploited by selling while the informed trader drives 
the price up. We will denote a trading strategy from this one-parameter family by 
the pair (lOOd, 100(1 — d)). Thus, an uninformed, or liquidity, trader is denoted 
by (50,50) and a maximally biased one by (100,0). Essentially, the different 
traders have different belief systems about the market. As mentioned, due to the 
simplicity of the model we may not ask how it is that the different traders arrive 
at these different expectations. It may be that they have different information 
sets, or it may be due to the fact that they process the same information set in 
different ways, or, more realistically, a combination of these. 

Given that we wish to compare the relative profitability of the different trad- 
ing strategies we need to define profits. Here, the profits of agents are given in 
terms of a ’’moving target” where excess profit during timestep t is related to the 
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increase in the market value of an active trading portfolio in the timestep t rela- 
tive to the increase in the market value of a buy and hold portfolio in the same 
timestep. In this way an excess profit for a given trader over the timestep t can 
only arise when there has been a net change in the trader’s portfolio holdings in 
the asset and a net change in the asset’s price. This choice of benchmark always 
refers the market dynamics to a “zero sum” game, while with other benchmarks 
this is not the case. More concretely, we define the “excess” profit of a trader i 
in the time interval t — 1 to t to be 

- 1) = 5ni{t)5p{t), (3) 

where 6ni{t) is the change in portfolio holdings over the timestep 6t = t — {t — 1) 
of the trader i and 6p{t) is the change in asset price over this timestep. The 
excess profit earned between times t' and t is 

n—t 

= '^ei{n,n-l) (4) 

n—t' 



2.3 Adaptation and Learning in the Presence of Endogeneous and 
Exogeneous Information 

In the context of only informed and uninformed whose strategies remain static 
there can be no adaptation in the market, nor any learning. In order to caricature 
these elements we introduce adaptive agent strategies wherein an adapting agent 
may copy the strategy of the most successful agent currently in the market (the 
copycat strategy). In this sense the copycat agents have to both learn or infer 
what is the best strategy to copy, and then adapt their own strategy in the 
light of this new knowledge. The manner in which they do this is via standard 
“roulette wheel selection” as commonly used to represent the selection operator 
in Genetic Algorithms [7], to update their strategies using accumulated excess 
profits as the “fitness” function. In other words, a copycat copies the strategy of 
agent i with probability 

(5) 

I 

where Ei(t,t') is defined in Eq. (4). They observe the market, updating their 
information at a fixed frequency, for example, every 100 ticks, and copy the 
agent’s strategy that wins the roulette wheel selection process. Given that the 
roulette wheel selection is stochastic it may be that a copycat does not copy the 
strategy with the most excess profit. It is important to clarify that we are not 
interested in pinpointing the agent which is copied in so far as we are interested 
in identifying the strategy that is picked during the process. This way we can 
conceive the selection mechanism as a roulette divided not in small regions rep- 
resenting individual agents, but rather in bigger slices that account for each of 
the strategies present in the market; Eq. (6) depicts this situation 

m 

i=i 



(6) 
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where Sj{t,t') is the sum of accumulated profits (Eq. (4)) of all the agents with 
strategy j, and where m is the total number of strategies present in the market. 

The more successful a strategy is relative to others, the more likely it is that 
this is the strategy copied. The stochastic nature of the copying process reflects 
the inefficiencies inherent in the learning process. 

In the above we described how copycats learn and adapt in the presence 
of endogeneous information, i.e. information intrinsic to the market itself, this 
information being the relative profits of the different trading strategies in the 
market. This information is dynamic and so the copycats change their expecta- 
tions and beliefs about what will happen in the future. In real markets, however, 
new exogeneous information frequently arrives. In this context one must ask how 
the different traders will react to this information. Such exogeneous information 
is usually taken to be random. We will follow this paradigm here, imagining the 
exogeneous information to be in the form of information “shocks” . In this case 
we assume that the participants are forced to change their perception of the 
state of the market — and therefore their buy/sell strategy. At this point, two 
interesting questions can be raised: First, how is the learning process affected by 
these information shocks? And second, is the information prior to a shock useful 
in the learning process? 

3 Experimental Results 

We will answer the above questions in the context of various simulations carried 
out using the NNCP artificial market. However, before presenting the experi- 
ments with their results, we will discuss some aspects of the copycat’s adaptation 
process. As it has been mentioned previously, in order to adapt their strategies 
to the market’s conditions, copycats must “play” a roulette formed from the 
profits of each strategy in the market (Eq. 6). It is convenient to recognize the 
stochastic effects of this game, in particular those produced by the composition 
of the market. We can illustrate this by thinking of a market where the copycats 
copy via roulette wheel selection the most popular strategy. Suppose the roulette 
is formed by the number s of agents that possess a strategy given by 

m 

Pi{t) = (7) 

3 

One question we can answer is how many copycats will adopt strategy i at time 
t. Let C be the number of copycat agents, I the initial number of agents with 
strategy i and T the total number of agents (i.e. Sj{t)). When t = 1 (the 

first adaptation) it follows that 

W(1) = C*P,(1) = /*C/T; (8) 

where Xi{t) is the average number of copycats that adopt strategy i at time t. 
Now, when t = 2, the number of agents with strategy i is I -I- Xi{l), and in 
general, after K adaptations, one has 
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Fig. 1. Number of copycats that learn strategy i in simnlations with different values 
of {7,C,T}. The values are {300,150,600} (left) and (200, 200, 600} (right). 



k 

Xi{K) = C*P,{t-l) = I*C/T+I*C‘^/T'^ + ...+I*C^/T^ = (9) 

3 



Obviously, C < T, and therefore 



X,{t)t^^=I*C/{T-C). (10) 

which is the expected maximum number of copycats that will copy the most 
popular strategy. In Figure 1 we show the result of the first 20 adaptations in 
markets with different values of I, C and T, where copycats adapt every tick 
and the most popular strategy is associated with informed (51,49) agents. The 
graphs are a result of averaging over 10 different runs. In Figure 1 we see how 
the number of correct copycats asymptotes to a value close to that given by Eq. 
(10). There is a slight difference in that the graphs are for copycats that copy 
the most profitable strategy. However, for weak bias we see that Eq. (10) gives 
a good approximation. More generally, it gives a lower bound for the number of 
correct copycats. 

Returning to the problem of learning: The objective of a copycat is to acquire 
the optimal strategy (i.e. the strategy that maximizes profits constrained to 
existing market conditions); conversely, the objective of the biased traders is 
to create an excess demand. This excess demand thus drives the price via the 
price evolution equation (Eq. 1) along with the profits of informed agents, as 
has been noted in previous work [4,8]. Additionally, both the excess demand 
and the profits of the informed traders depend on the composition of the entire 
population as well as on the distribution of biases. In this scenario, copycats 
try to copy informed traders to find the optimal strategy. This activates the 
learning process. However, complete learning is by no means guaranteed in the 
sense that they do not necessarily identify the best strategy. The quality of the 
learning depends on the signal to noise ratio (i.e. the size of the different regions 
in the roulette), which in its turn depends on the agent biases and the market 
composition. Note that the learning might be incomplete even in the case where 
there is only one other strategy to learn. As an illustration of the latter consider 
Figure 2, where we show the number of copycats that learn the correct strategy 
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Learning with different market compositions 




Fig. 2. Incompleteness of learning: number of copycats that learn strategy i in exper- 
iments with different market compositions 



in three different experiments. In the first case (Experiment A), the market 
is composed of 20 agents of each of the following strategies: (50,50), (60,40), 
(70,30), (80,20), (90,10), 100 (99,1) agents, and 100 copycats. In Experiment 
B the market is formed by 100 uninformed agents (i.e. with a (50, 50) strategy), 
100 (60,40) agents and 100 copycats. Finally, Experiment C was composed of 
100 (50, 50) agents, 100 (99, 1) agents and 100 copycats. The roulette at time t 
was built using Ei{t, 0), that is, the profits calculated since the beginning of the 
experiment. 

In Experiment A, the optimal strategy is (99,1). However, the presence of 
other strategies with lesser yields confuses the copycats in such a way that only 
about 70% of them present successful learning, i.e. that identify the optimal 
strategy, (the average due purely to the composition of the market is 50% in 
all cases; this can be derived through simple probabilistic arguments with the 
use of the roulette). In Experiment B we can observe that the interaction of 
the (50, 50) and (60,40) strategies generates only a relatively small signal, hence 
explaining why the number of copycats that learn the best strategy is only 
slightly bigger than the average of the market’s composition. Experiment C 
shows the imperfection of the learning process, even in a market with a very large 
difference in biases, i.e. that due to the stochastic nature of the roulette wheel 
complete learning cannot take place. We see in general then that the efficiency 
of learning depends on the market biases and the diversity of strategies in the 
market as well as the stochastic nature of the roulette wheel selection. 



3.1 Simulating Markets with Exogeneous Shocks 

As mentioned earlier, we model the effect of an information shock by changing 
the perception of all the market participants. This is done by means of a ran- 
dom re-selection of strategies among the agents at the moment of the shock. 
Specifically, by labeling the agents as either informed (/) or uninformed ([/), 
we can visualize an information shock as a moment ts in which all the market’s 
participants select a strategy again, either U or I. Thus, an agent that prior 
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to a shock is informed may become uninformed with a certain probability s or 
may remain in his original state with a probability 1 — s^. With this scheme, 
over a period that contains several shocks an agent can no longer be tagged as 
being either informed or uninformed. However, we can label them with a chain 
that represents the different states they have occupied over a certain interval 
of time^. For instance, an agent may be defined by the sequence lUUI, which 
means that, starting out as an informed, it changed into a uninformed at the 
first shock, remained in this state after the following shock and returned to an 
informed strategy after the third shock. This simulation of information processes 
clearly creates a new level of complexity in the system since now we face many 
more options than the ones present in a static market. 

Taking this into account, after a shock an evolutionary agent must re-learn 
what is the optimal strategy under the new market conditions. An interest- 
ing problem is determining how much endogeneous information an evolutionary 
agent needs to learn the best strategy, given the dynamic conditions of the mar- 
ket. In the examples presented in the first part of this section, each evolutionary 
agent used all the history of the trader’s profits to make a decision, i.e. each 
agent had “long-term” memory. However, with the introduction of shocks into 
the system, the evolutionary agents now face new difficulties in processing the 
available information. In this sense, the amount of data used for inference be- 
comes a crucial matter: it is not the same considering a multi-shock time window 
when the strategies can shift between informed and uninformed than when they 
remain static. And so, we may pose the following question: Will the same infor- 
mation be as useful in a system with changing perceptions? In Figure 3 we show 
the results of two experiments that shed some light on this question. In Experi- 
ment D the copycats try to copy a strategy using only the information generated 
by the market after each shock. Thus, they have only “short-term” memory as 
they do not keep in their memory any information prior to the shock. In con- 
trast, Experiment E depicts the case in which copycats have long-term memory, 
preserving the entire information of the market’s history without distinguishing 
data obtained before and after shocks. In these experiments the traders change 
their perception of the market to one of two possible states: either an informed 
or an uninformed strategy. In other words, despite the shocks, during all the pe- 
riods of the experiment, the optimal strategy is invariant, i.e. to be constanstly 
informed; what changes are the perceptions of the participants which affects the 
strategy that they choose and therefore the profits they have accumulated. At 
the same time, after each shock an arbitrary agent has the same chance of ending 
in the set of the informed as in the set of the uninformed, in such a way that 
the composition of the market in each period is, on average, always the same. 



^ In the experiments presented in this paper that involve shocks, s takes a value of 
unless otherwise indicated. 

^ Since the definitions of informed and uninformed are mutually exclusive, we can 
consider them as states that describe the actual strategy of an agent. This way, if 
an agent is informed, we can say that he is in state / or, in other words, that he is 
occupying state I. 
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Fig. 3. Learning in markets with exogeneous shocks: copycats with long-term (right) 
and short-term (left) memory 



We can see in this figure that the learning process where traders only use as 
their learning information set market information from the last shock until the 
present moment better adapt to the new market conditions. This is especially 
true in the present case where the first shock arrives when the learning process is 
almost finished; the shock produces a reset of the process but the learning pro- 
cess itself stays the same. After each shock the copycats realize that they must 
adjust their perceptions in the new market conditions by relearning everything. 
Meanwhile, in the case of learning with long-term memory, it is much more dif- 
ficult for the copycats to identify the correct strategy after each shock as they 
are using past information that is no longer relevant. For example, they end up 
copying a strategy that was useful and therefore accumulated significant profit 
before the shock but is suboptimal after the shock. By the time the copycat has 
realized that the strategy is no longer optimal it has made significant losses. 

4 Conclusions 

We have shown here that it is possible to produce very interesting results utiliz- 
ing only a fairly simple computational model. Although the model itself has little 
complexity and is relatively small — a significant benefit that expresses itself in 
small run times — our work on the NNCP has given us some interesting ideas 
on how financial markets might deal with external information. Among the most 
important findings, we can identify the role of memory during the learning pro- 
cess. As shown in section 3.1, the relevance of information considered during the 
selection of a strategy is of vital significance when it comes to optimizing prof- 
its: using irrelevant information from before the shock to determine the optimal 
post-shock strategy results in poor learning and a more efficient market. 

And though the identification of shock-like structures in real financial markets 
is still somewhat controversial, results like these might translate into practical 
trading techniques in the future. We have also shown that learning is very much 
a statistical inference process in the context of a financial market and have 
exhibited some of the factors on which the efficiency of the learning depends. 

There is, however, much more work ahead. The configuration of NNCP, as 
used in this paper, is quite simple, so it is not implausible to have left a large 
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collection of behaviors out of the simulations. In this sense, we can consider 
several improvements to the model aimed at better describing the nature of 
financial markets, or at least at refining our approximations of their descrip- 
tion. For instance, future models might consider more complex mechanisms of 
learning as well as sophisticated information shocks that do not affect the entire 
market. Equally important, the participation methods of the artificial agents 
could be enhanced into more than stochastic rules defined by a probabilistic 
bias. Nonetheless, all these approaches demand a better understanding of the 
system (from the mechanisms of learning to the role of market structure) as well 
as increased computational requirements. 
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Abstract. Capitalizing and diffusing experience about multiagent sys- 
tems are two key mechanisms the classical approach of methods and 
tools can’t address. Our hypothesis is that, among available techniques 
that collect and formalise experience, design patterns are the most able 
technique allowing to express the agent concepts and to adapt itself to 
the various MAS developing problems. 

In this paper, we present several agent oriented patterns, [1], in order to 
demonstrate the feasibility of helping MAS analysis and design through 
design patterns. Our agent patterns cover all the development stages, 
from analysis to implementation, including re-engineering (through an- 
tipatterns, [2]). 



1 Introduction 

Analysis and design are becoming an important subject of research in multiagent 
systems (MAS). Now MAS are more and more used, the need for tools and 
methods allowing the quick and reliable realisation of MAS is clearly appearing. 
This need is appearing in the MAS community itself but it’s also a necessity to 
empower a larger community to use the agent paradigm. 

Present attempts of the community focus on MAS design methodology. In 
fact, MAS paradigm development is restricted by the gap between the apparent 
ease to seize the basic concepts (as agent, role, organisation, or interaction) and 
the difficulty to create a MAS which resolves or helps to resolve a real world 
problem. Then, to help the MAS designer, we seek a new formalisation. 

As agents are often implemented by the way of object programming, the 
idea to use object-oriented analysis and design methods is appealing. But those 
methods are not applicable, simply because objects and agents do not pertain to 
the same logical and conceptual levels - for example, one can program a MAS 
in a functional language. Nevertheless, we can draw our inspiration from object 
methods and from their development to create methods for MAS. 

Reuse has three targets: structures, processes, and concepts. Each of these 
targets can be viewed at different abstraction levels, for example, structures 
address both the code, the organisation of the code and the architecture of the 
concepts. As shown on Table 1 on page 353, among the reusing techniques, the 
design patterns, [1], cover all three targets and, even if the coding part is a little 
less covered than with components, the method (z.e., the path from the problem 
to its solution) is much more covered. 
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Table 1. Reuse in reusing techniques 





Structures 


Processes 


Concepts 


Duplication 


code 


— 


— 


Functions library 


code 


algorithms 


— 


Classes library 


code 


comportment 


classes 


Components 


code, architecture 


comportment 


classes 


Frameworks 


code, architecture 


algorithms 


(classes) 


Design patterns 


(code), architecture 


algorithms, method 


classes, models 


Design methods 


— 


method! 


models 



Those reasons show that design patterns are strongly appropriate to resolve 
the MAS design problems. In fact, design patterns are the perfect means to 
spread the concepts, the models, and the techniques used in the MAS design 
and in their implementation. 

In the following section, in order to demonstrate the feasibility of helping 
MAS analysis and design through design patterns, we will expose shortened 
versions of various agent oriented design patterns. 

2 Agent Oriented Design Patterns 

We describe here eleven agent oriented design patterns. The fields we use to 
formalise our patterns are the following: its name, a sysnopsis, the contextual 
forces of its application, real examples of its usage, the solution it proposes, some 
implementation issues, an examination of its advantages and disadvantages, and 
associated patterns, as design patterns are tied together (co-operation, use, del- 
egation, or conflict). 

We class our patterns in four categories: metapatterns, metaphoric patterns, 
architectural patterns and antipatterns. 

2.1 MetaPatterns 

Metapatterns are so called because they are patterns of higher conceptual level 
than other patterns. Their scope is wider, they address problems at all stages 
of the design process, from analysis to implementation. Moreover, they also 
cover children patterns, which are more specific patterns that use and detail 
the concepts and the solution proposed by the associated metapattern. 

For their formalisation, the forces field is replaced by concepts. 

Organisation schemes explains the concepts associated with organisation and 
describes their numerous uses. 

Concepts 

— Agent: an autonomous entity. 

— Role: a function an agent can take in an organisation. This function can 
have a short life (undertook and shortly after abandonned) or a long life 
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(the function survives the agent as several agents follow each other at its 
charge). Each role defines associated behaviours, interactions and relations. 

— Organisation: a structure grouping several agents which undertake roles in 
the organisation scheme from which this organisation is modelled. 

— Organisation scheme: an abstract structure grouping role descriptions. It’s 
a class of organisations. 

Examples of the usage of organisation schemes are various. They spread from 
analysis and design, [3,4, 5,6,7], to implementation, [8,9]. 

The Solution. This patterns proposes for the analysis of a MAS is to use organ- 
isation schemes as a lecture grid: organisations in the MAS are discovered and 
described by the comparison with a catalog of organisation schemes ( [10] could 
be a sketch for such a catalog). 

In MAS design, this pattern proposes to integrate and describe roles and 
organisations in the UML class diagrams of the designed MAS. This integration 
allows to virtualize the services the roles represent: relations between roles only 
depend on the roles, not on the agents undertaking the roles. 

In MAS implementation, this pattern proposes to embody roles and organi- 
sations as objects, so that agents could be able to use and control them. 

The Examination of this pattern shows us various advantages. Among them 
is the ability this pattern gives the designer to integrate, through high level 
concepts, real world knowledge into the description of the MAS he is designing. 

Another advantage is the ability, by dividing the system into various organ- 
isations, to divide and conquer the analysis and design of the system. 

This pattern also has a disadvantage: it fixes the system’s organisation, 
though limiting its reactivity and adaptability by restricting its ability of auto- 
reorganisation. 

Associated patterns to this patterns are, of course, its children patterns: patterns 
about specific organisation structures and models that explain specific roles, their 
associated behaviours and interactions. 

Protocols covers the concepts associated to interaction and communication 
protocols. 

Concepts 

— Agent: an autonomous entity. 

— Message: an information object transmitted by an agent to another one. 
This concept also addresses speech acts, [11]. 

— Interaction: communication, perturbation or influence link between agents. 
In a communication protocol, it’s a conversation. 

— Role: a function an agent undertake in an interaction. Each role defines the 
messages the agent can send. 

— Protocol: a set of rules allowing two or more agents to coordinate. A protocol 
defines the order and the type of messages and actions. 
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Examples of the use of Protocols are numerous in analysis and design, [12, 13, 
14,15,16]. 



The Solution. This pattern proposes is to find messages, conversations and inter- 
actions in the system and to abstract them into protocols. Another complemen- 
tary way of using protocols is to use a catalog of protocols as a grid to analyse 
the system (such a catalog can be written from the list of protocols the FIPA 
proposes: http : // www . f ipa . org) . 



Examination. Identifying protocols at an early stage allows a better identifica- 
tion of the roles and the interactions of the system, and of the messages agents 
can send. 

In another hand, as design patterns already do, protocols form a common 
vocabulary that enables a better analysis and a better description of the system. 

Associated patterns to this one are its children patterns: patterns explaining 
the application context and the constraints resolved by a particular protocol. 
Organisation schemes is associated to this pattern as they both share the main 
concept of role. Influences and Marks are also associated patterns: communica- 
tions between agents are a form of influences and marks are a form of messages. 



2.2 Metaphoric Patterns 

Metaphoric patterns are widely used is MAS. They describe the use of a solution 
that is inspired from a discipline which, at the first glance, seems to be totally 
exotic to multi-agent systems and to their design. 

As they have external origins, a new fied, origins, is added to their formali- 
sation. 



Marks is a pattern about the metaphor of pheromones, a model of communi- 
cation through the environment. 

Forces 

— Agents have limited memory capabilities. 

— Agents have limited communication capacities. 

— Agents are situated: there are constraints affecting their positions, their 
moves. 

— There are constraints affecting the resources agents can use: limited speed, 
energetic autonomy, and time. 

— Informations that need to be kept or shared have spatial characteristic (they 
are only locally relevant). 




356 S. Sauvage 



The Origins of this metaphor are biological. The principle is especially used by 
insect species. As an example, an ant deposits slight quantities of a chemical 
matter (called pheromone) which enables it to make a track of its path (exter- 
nal memory) and to point out this path to its congeners (communication and 
recruitment medium). Pheromones directly induce a specific behaviour for the 
individual perceiving them - it’s a reaction, as in the reactive-cognitive oppo- 
sition of the agent literature. Pheromones act in the same way as the nervous 
system chemical transmitters. 

Examples. Communication by marks and especially by pheromones has been 
studied and used in various works, [17, 18, 19,20]. 

Solution. The agent deposits marks in the environment. These marks are low 
granularity objects and have a minimal size. They have some strength that 
allows the agent to sense their presence at a known distance. Once they are 
deposited, the environment takes care of their mutation: evaporation, dissem- 
ination/propagation, and perception. Those marks are perceived in a limited 
zone and in a limited way (mainly related to associated agent sensors). 

The marks allow the agents to communicate without message exchange: de- 
positing some quantity of a known mark is a modification of the environment 
every agent perceives. 

Implementation. If the goal is to have agents living in the real world {e.g. robots), 
therefore marks are slight quantities of a product that agents are able easily and 
distinctly to perceive. It’s unlikely that a chemical matter be used, but one can 
use marks which will be easy to distinct and which will be persistent to their 
emission. 

If the agents live in a simulated world, therefore the environment has to be 
pro-active to manage the marks’ life-cycle: evaporation, dissemination (due to 
the wind, other agents’ moves), masking (by other marks), disappearance (even 
removal) . 

Examination. There also are problems due to the use of this pattern. Its imple- 
mentation in a real world (that is, with robots) is not so easy: there are some 
troubles to find the good product - that implies troubles with the sensors, with 
their sensitivity and accuracy. 

Concerning the difficulties of the simulation, the required memory and com- 
putational resources can’t be neglected. This pattern also gives a great role to 
the environment. 

However, this pattern has plenty of advantages, in addition to resolving the 
constraints exposed in the field Forces. Marks are simple messages - as much 
by their form than by their handling - and the fact that they are deposited 
allows integrating a locality notion to the information they carry. Moreover, this 
integration is done in an indirect way for the agent: he does not need a coordinate 
system, marks are deposited where they have a signification. 
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Associated patterns to this one are mainly Influences and Protocols: the attrac- 
tion/repulsion effect of marks is mainly an influence and marks are somewhat a 
type of message. 

Influences allow separating causes and effects of primitive actions to overcome 
simultaneous actions conflicts. 

Forces 

— There are global influences and forces. 

— There may be simultaneous actions. 

— An agent can perturbate another one whithout going through the second 
agent’s cognitive mechanisms (no asynchronous message, physical and social 
levels are disctinct). 

The Origins of this metaphor are physics: influences are forged from physical 
forces. Objects (and though agents) don’t modify each other, instead, they create 
forces that, once combined, have an action. 

Examples of using influences are various, [21,22,23,24]. 

The Solution proposed consists in distinguishing the action an agent apply to 
another agent and the action this other agent undergoes. 

The Implementation of this solution can use the algorithm defined in [22]. It 
can also use gradients and a discrete space algorithm to compute them: the 
wave gradient algorithm, [25]. 

Examination. Concurrent actions conflicts are frequent and difficult to handle. 
Moreover, it is often convenient to have a mechanism to handle global influences. 
Influences are also suitable to handle inertia, noise or friction. 

The main disadvantage of this pattern is that the combination of influences 
and the calculus of their effects have to be done by a mechanism that is outside 
the agents. It then leads to a global action controller, though being in contra- 
diction to the principle of MAS: control distribution. 

Associated patterns. The Marks pattern can benefit from this pattern as a mark 
is a source of attraction/repulsion. The three antipatterns we expose below (Dis- 
cretisation, Iniquity and Physical entity) can be applied to implement this one, 
as Command and Composite can be, [1, pp. 86, 233]. 

2.3 Architectural Patterns 

Architectural patterns descibe the internal architecture of an agent. As an agent 
often has many tasks to accomplish and as these tasks require skills of different 
cognition levels - some are reactive tasks, some are deliberations -, the structure 
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of an agent is often decomposed into several modules. The architectural patterns 
show different ways to discover and arrange these modules and their interactions. 

For now, we can see four main agent architecture types: BDI, Vertical, Hor- 
izontal, and Recursive architectures. 



BDI architecture mainly embobies the BDI model used, for example, in [26, 
27, 28] . The principle of this architecture dwells in four knowledge bases: beliefs, 
desires, intention and plans. 



Vertical architecture proposes to layer the modules in knowledge levels, each 
module having its own knowledge base, [29,30,31]. 



Horizontal architecture parallelizes the modules so they can deliberate and 
act all together, each one for its own purpose and/or at his own knowledge 
purpose, [32,33,34]. 



Recursive architecture sees an agent as a multiagent system. In other words, 
the modules composing the agent are seen as micro-agents acting all together in 
a micro-environment, which is the macro-agent itself, [35,36,37]. 

2.4 Antipatterns 

Antipatterns, [2], are somewhat special patterns as they don’t explain how to de- 
sign a system but how to redesign and correct common mistakes. These mistakes 
are explained in the field we called DysSolution. 



Iniquity is what happens when parallel calculus is inappropriately simulated: 
resources are not equally managed. 



Discretisation is a partial loss of information, especially with numerical data. 



Physical entity is to be used when, for practical reasons, the agent handles 
the physical actions applied to itself with its deliberative modules. That is, when 
there is no separation between the agent’s physical part of its rational part. 

3 Conclusions 

Different works about agent patterns exist. Some are short or more object-like, 
[38,39,40,41]. Some are interesting, as Aridor’s, Kendall’s, Deugo’s, or Aarsten’s 
[42,43,44,45,46], but it’s a consideration not to limit the MAS application of 
the pattern technique to the simple use of patterns in a particular domain that 
MAS are. Indeed, some papers submit patterns presenting agents and agent 
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techniques in an object-oriented way, i.e. as object techniques, leaving out the 
fact that agents are much more than objects - as autonomy, finality, interaction, 
and the fact they form a multiagent system. 

That’s why the eleven agent oriented patterns we have just presented short- 
ened versions are of a higher abstraction level than object oriented design pat- 
terns usually are. 

We think that patterns can help to develop MAS, as much in analysis, design 
or implementation as in teaching and spreading of the agent paradigm. Present- 
ing as patterns the models and concepts of the MAS paradigm, as well as the 
techniques used to implement them, would enable us to structure, spread and 
constructively extend the knowledge of the agent paradigm we now have. Indeed, 
more than their uniform structure, patterns allow to understand more easily the 
concepts - for they emerge from experience and widely used examples, but also 
for they integrate theory to examples and they explain the conceptual reasons 
behind implementation techniques. 

For now, our main goal is to present our agent patterns, as in [47], submit 
them to discussion and enhancement, and show their use through the develop- 
ment of MAS. 
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Abstract. In combinatorial auctions, a bidder may bid for arbitrary 
combinations of items, so combinatorial auction can be applied to re- 
source and task allocations in multiagent systems. But determining the 
winners of combinatorial auctions who maximize the profit of the auc- 
tioneer is known to be NP-complete. A branch-and-bound method can 
be one of efficient methods for the winner determination. 

In this paper, we propose a faster winner determination algorithm in 
combinatorial auctions. The proposed algorithm uses both a branch-and- 
bound method and Linear Programming. We present a new heuristic bid 
selection method for the algorithm. In addition, the upper-bounds are 
reused to reduce the running time of the algorithm in some specific cases. 
We evaluate the performance of the proposed algorithm by comparing 
with those of CPLEX and a known method. The experiments have been 
conducted with six datasets each of which has a different distribution. 
The proposed algorithm has shown superior efficiency in three datasets 
and similar efficiency in the rest of the datasets. 



1 Introduction 

Combinatorial auctions(CAs) allow a bidder to tender a bidding on a combina- 
tion of distinguishable items. However, determining the bidders in CAs whose 
bids give the auctioneer the highest profit is AP-complete[ll]. The problem of 
determining such bidders is known as the winner determination problem. Solving 
this problem can be applicable to various practical auctions for airport landing 
slots, transportation exchanges, spectrum licenses, pollution permits, computa- 
tional resources, and so on [2] [3] [7] [8]. 

There are two important characteristics of biddings in CAs. One is comple- 
mentarity - the valuation of a set of items that a bidder wants is more than the 
sum of the individual items. The other is substitutability - the valuation of a set 
of items that a bidder wants is less than the sum of the individual items. So CAs 
can be applied to resource and task allocations in multiagent systems in which 
items have the above characteristics. 

The winner determination problem can be defined formally as follows. 
Let there be n bidders and m items. We denote the set of bidders as B = 
{bi,b 2 , ■ ■ ■ ,b„} and the set of items as S' = {1, 2, . . . , m}. Let bi = (si,pi) be 
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the bid for a bidder bi, where Si is a nonempty subset of S and pi is the price 

that bi will pay for the items in Then the winners are determined with the 

following equation under the condition that each item can be allocated into at 
most one bidder, where if bi is a winner then Xi = 1, otherwise = 0 for 

1 = 1,2, ... ,n. 

n 

max E XiPi s.t. E Xi < 1 ,yj G s ( 1 ) 

i—1 i\j^Si 

Xi = 0 or 1, for i = 1,2, . . . ,n. 

When the items are allowed to be allocated partially to the winners, Equation 
(1) can be converted into the following equation; that is, the winner determina- 
tion problem can be reduced to a linear programming(LP). 

n 

max E XiPi s.t. E Xi < 1 ,yj G S (2) 

i—1 i\j^Si 

0 < Xi < 1, for i = 1,2, ... ,n. 

The above equation helps a branch-and-bound method with pruning the search 
space to reduce the search time [9]. 

Andersson et al.[l] showed that the winner determination problem can be re- 
duced to a mixed integer programming(MIP). They used CPLEX to solve MIP. 
Sandholm et al.[12] presented an optimal allocation algorithm called CABOB, 
in which a branch-and-bound method is used to guarantee the optimal solu- 
tions, LP is employed to get the upper-bounds, and some heuristic methods are 
proposed to improve its performance. BOB[15] is the original model of CABOB, 
and introduces various methods appeared in CABOB. CASS[5] constructs a data 
structure called BIN to exploit the characteristics of CAs and explores the search 
space with DFS for the optimal solution. In [5], [10], and [13], several approx- 
imation algorithms for the problem are given. [9], [11], and [13] provided some 
other methods to limit the bids. 

In this paper we propose a faster optimal allocation algorithm for CAs that 
uses both a branch-and-bound method and LP. In the proposed algorithm we 
introduce a new heuristic bid selection method. It also reuses the upper-bounds 
to reduce the computation time. The experiments have been conducted with 
six datasets as in [12] each of which has a different distribution. The proposed 
algorithm has shown better performance than CPLEX and CABOB for three 
datasets and showed similar performance for the rest of the datasets. 

The rest of this paper is organized al follows. In Section 2 describes the 
proposed algorithm in detail. Experimental results are given in Section 3. Finally 
the conclusions are made in Section 4. 

2 The Proposed Algorithm 

The proposed algorithm exploits a branch-and-bound method with BFS (best- 
first-search) to traverse the search space globally, while it uses DFS(depth-first- 
search) locally within a portion of the search space in which there are only a 
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certain amount of remaining bids to be searched. The reason we introduce a 
hybrid search of BFS and DFS is that we could overcome memory shortage 
during search when only BFS is used and could avoid longer search time when 
only DFS is deployed. 



Algorithm 2.1 

1. PriorityQueue Q; 

2. Node u, v; Opt_value = 0; 

3. If (the input is the complete case) then 

4. return the winner from the complete case; 

5. Initialize v with the initial bid; 

6. Insert v into Q; 

7. While (Q is not empty) do 

8. Dequeue v from Q; 

9. Upper_bound = Bound{v); // compute the upper-bound with LP / / 

10. If (Upper_bound > Opt.value) then 

11. If (the number of remaining bids < 10% of the total number of 
bids)then 

12. Search the remaining bids with DFS; 

13. else 

14. If {y is the integer case) then solve integer programming; 

15. Select a bid b with the proposed heuristic bid selection method; 

16. Create nodes u\ and U 2 such that u\ includes b and U 2 does not; 

17. Opt_value = max{ Opt .value, value(ui), value(u 2 )}; 

18. If {Bound{ui) > Opt.value) then insert Ui into Q, for i = 1,2; 

19. end while 

The proposed algorithm, first, checks if the input is the complete case in which 
there is only one winner and the biddings from the rest of bidders conflict each 
other. If so, we terminate the algorithm with the winner (Line 4). Otherwise, we 
continue to the next step of the algorithm. There is another special case called the 
integer case in which, for each bi, Equation (1) results in either Xi = 0 or Xi = 1. 

Then we initialize v with the initial bids and insert v into Q. Note that a 
node V holds the bids selected so far, the rejected bids, the upper-bound, the x 
values, and value(v) which is the sum of the prices of the selected bids. Next, the 
algorithm performs the while-loop unless Q is empty. Within the while-loop, 
node V is dequeued from Q (Line 8). Bound{v) returns the upper-bound for v 
using LP with Equation (2). If Bound{v) is greater than Opt.value which is the 
optimal value found so far, we check if the number of remaining bids is less than 
10% of the total number of bids. If so, we perform DFS for the rest of the bids 
(Line 12). During DFS, we calculate the upper-bound of each node and update 
Opt.value, if the upper-bound is greater than Opt.value. Otherwise, the algo- 
rithm carries out BFS. Although BFS takes shorter time, it needs an exponential 
amount of memory. Hence, we used DFS along with BFS. Note that DFS has 
shown that its execution time is longer, while it uses smaller amount of memory. 
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Hence we need to find a proper percentage of BFS among the entire search to 
save memory without affecting the execution time. We have tested various per- 
centages of BFS from 5% to 30% incrementing 5% at a time. The test results have 
shown that when 10% of the search is done by BFS we obtained better results. 
Before branching from w, if is the integer case, we solve integer programming 
to obtain value{v). Otherwise, we select a new bid from the rest of the bids using 
the proposed heuristic bid selection method (Line 15). After creating u\ and U 2 
(Line 15), for u\ and rt2, we find the upper-bound to see if it is greater than 
Opt_value. If so, we insert it into Q for the next iteration of the while-loop. 
The loop is terminated when Q is empty. In the rest of this section, we describe 
a new heuristic bid selection method and explain the reuse of the upper-bounds. 



2.1 A New Heuristic Method for Bid Selection 

We propose a new heuristic bid selection method in which the bids that are not 
selected are searched with a branch-and-bound method in order to select the bids 
that may belong to the optimal solution. Some heuristic bid selection methods 
appeared in the previous work utilize the coefficients of LP or the information 
(bid prices and the number of items) submitted by bidders[6] [9] [12]. Some other 
method uses a graph, in which each node represents a bid and each edge links 
two bids that bid the same item, to utilize the degrees of nodes in selecting the 
bids [12]. 

In this paper we use both the coefficients of LP and a bid graph approach. 
Let Cij = 0 if f = j or if bi and bj bid on different items, Cij = 1 if 6^ and bj bid 
on at least one common item; we say that bi and bj conflict each other. 

We now describe the proposed heuristic method called Conflict Bids 
Sum(CBS) for bid selection. For bi, if Xi > 0.5, we obtain cbsi by summing 
up the X values for all the bids conflicting against bi. Otherwise, cbsi = Xj. CBS 
selects the bid with the highest cbsi. 



cbsi = 



^ ^ij) 



, Xi > 0.5 
, otherwise 



(3) 



CBS treats the bids with their x values greater than or equal to 0.5 as if they 
have the same x value and uses the sum of the x values of the conflicting bids 
against them for determining the priority in selection. Therefore, even though 
bi has a higher x value, bid bi is not likely to be selected if cbsi is small. CBS 
tends to select bids with both higher x values and higher cbs values. CBS selects 
first the bids with values close to 0.5 and then selects the bids with indecisive x 
values so that CBS can reach the integer cases faster. 

Fig. 1 shows an example of a bid graph. For each bi, we can obtain its cbsi 
using Equation (3): 

cbsi{xi < 0.5) = xi = 0.0, cbs 2 {x 2 > 0.5) = x\ + x^ + X 5 = 0.0 -I- 0.3 -I- 0.0 = 
0.3, cbs^{x^ < 0.5) = X 3 = 0.3, cbsi{x 4 > 0.5) = x^ = 0.0, cbs 5 {x^ < 0.5) = 
X 5 = 0.0, cbse{xe < 0.5) = xq = 0.2, cbsr{xr > 0.5) = X 2 . + xq = 0.3 -I- 0.2 = 0.5. 
In this example, 67 is selected as the bid to branch, since it has the highest cbs 
value among others. 
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Fig. 1. An example of a bid graph 



2.2 The Reuse of the Upper-Bounds 

There are two cases in searching with a branch-and-bound method; that is, 
branching after either including the bid selected by the bid selection method 
into the solution set or not(Line 16 of Algorithm 2.1). For both cases, we may 
reuse the previous upper-bounds without calculating the upper-bounds with LP. 

• Case 1: (Including the bid selected by the bid selection method) When the x 
value of the selected bid is 1, the upper-bound at the time of branching is the 
same as the upper-bound before branching minus the price of the selected 
bid, because the x value of the selected bid and those of its conflicting bids 
are all zeroes and the conflicting bids are not considered when computing 
the upper-bound at the time of branching. 

• Case 2: (Excluding the bid selected by the bid selection method) The con- 
flicting bids against the selected bid are not excluded from the candidate 
bids. If the x value of the selected bid is 0, the upper-bound for the remain- 
ing bids is the same as that before branching. So we do not calculate the 
upper-bound again [6]. 

3 Experimental Results 

The Experiments have been conducted on a PC of a Pentium IV-l.OGHz proces- 
sor with memory of 512MB on Windows 2000. We used the CPLEX version 7.0. 
The input datasets have been created according to Sandholm et al.[12]. There are 
six different datasets - random, weighted random, decay, uniform, bounded-low 
and bounded-high datasets. Except the uniform dataset, the number of items in 
each dataset is one tenth of the number of bids. The following describes each 
dataset: 

o Random dataset: Select the items randomly from m items. Although there 
may be some duplicate items in the selected items, we do not select items 
for any replacement. A price is created by choosing a real number between 
0 and 1 and multiplying it 10,000. 

o Weighted random dataset: Select randomly the items as in the random 
dataset. A price is created by choosing a real number between 0 and the 
number of selected items and multiplying it 10,000. 
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o Decay dataset: Select randomly one item from m items and select an item 
with a probability a repeatedly until no further items can be selected, 
o Uniform dataset: For each bid, select the same number of items randomly. A 
price is created by choosing a real number between 0 and 1 and multiplying 
it 10,000. The number of bids for the experiment is set to 450. 
o Bounded-low dataset: Select k items, where k is chosen randomly between 
the lower and upper bounds. A price is created by choosing a real number 
between 0 and the number of selected items and multiplying it 10,000. The 
lower and upper bounds of the bounded-low dataset are 0 and 5, respectively, 
o Bounded-high dataset: Select the bids and determine the price as in the 
bounded-low dataset. The lower and upper bounds of the bounded-high 
dataset are 20 and 25, respectively. 

We implemented our proposed algorithm and CPLEX(version 7.0) for the 
experiments. Note that we obtained the results of CABOB from their work for 
the sake of fairness in comparison. That is, there are many unknown factors and 
parameters for us to implement CABOB. Therefore, we referred to the results 
of CABOB appeared in [12] for the comparison. CABOB was implemented on a 
Pentium III-933MHz with 512MB of memory. In an effort to compare CABOB 
with ours and CPLEX fairly, we used the same datasets as those CABOB tested. 
We have also compared the results on the algorithm using only DFS with others. 

Before we show the experimental results, we introduce a preprocessing of the 
datasets to speedup the entire processing. In the preprocessing, we filter out the 
bids that can never be the winners. For example, we can get rid of all the bids 
whose prices are lower than the highest price of a bid on the same item set. We 
can also remove more bids in the following situation. Assume that bt and bj want 
the items and Sj, and their prices are Pi and Pj, respectively. If Si C Sj and 
Pi > Pj, then bj can be removed in the preprocessing. 

Table 1 shows the average percentage of bids removed in the preprocessing 
for each dataset. We could remove a great amount of bids from the random and 
the decay datasets. However, we exclude the time for preprocessing in measuring 
the overall processing time for each dataset as CABOB did. 

Each point in the Fig. 2, Fig. 3 and Fig. 4 shows the average of the results 
on 100 different test data for the same number of bids. The proposed algorithm 
has shown better performance than both CPLEX and CABOB except for the 
weighted random and the decay datasets. But observe that the differences among 
these algorithms are not quite big especially for the weighted random dataset. 



Table 1. The average percentage of bids removed in the preprocessing 



Dataset 


Avg. percentage of bids removed 


Random 


79.72 


Weighted Random 


12.27 


Decay 


61.92 


Uniform 


0.17 


Bound 


5.98 
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Fig. 2. Random and Weighted random datasets 





Fig. 3. Decay and uniform datasets 





Fig. 4. Bounded-low and bounded-high datasets 
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4 Conclusion 

Multiagent systems need efficient resource and task allocations for complemen- 
tary and substitutable items. Combinatorial auctions can be employed to satisfy 
the needs. In this paper, we have proposed a faster allocation algorithm for the 
winner determination in combinatorial auctions. The proposed algorithm utilizes 
a branch-and-bound method and LP for searching the optimal solution. In this 
algorithm we have introduced a new heuristic bid selection method that con- 
siders not only the x value of each bid but also those of the conflicting bids at 
the same time. The experimental results show that the proposed allocation al- 
gorithm showed better performance than CPLEX and CABOB for the random, 
the uniform and bounded-low datasets and showed similar performance for the 
rest of the datasets. 
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Abstract. The neuronal regulators of biological systems are very difficult to 
deal with since they present nonstructured problems. The agent paradigm can 
analyze this type of systems in a simple way. In this paper, a formal agent- 
based framework that incorporates aspects such as modularity, flexibility and 
scalability is presented. Moreover, it enables the modeling of systems that 
present distribution and emergence characteristics. The proposed framework 
provides a definition of a model for the neuronal regulator of the lower urinary 
tract. Several examples of the experiment have been carried out using the model 
as presented, and the results have been validated by comparing them with real 
data. The developed simulator can be used by specialists in research tasks, in 
hospitals and in the field of education. 



1 Introduction 

At the moment, the theories developed to control complex systems that incorporate 
nonlinear aspects and that present unknown parameters are the optimal control, the 
adaptive control, the robust control [1]. The common characteristic of these theories is 
its sound mathematical basis. 

The neuronal regulators of the biological systems cannot be merely described by 
means of mathematical models and because of this, the data available are incomplete 
and vague and besides, most of the information is qualitative. The situation gets still 
more complex if distributed systems and emergent behaviour intervene as is the case 
with the neuronal regulators. It is in this context where the agent paradigm adds a 
greater level of abstraction allowing the solution of complex problem to be reached in 
a simpler way. 

Several mathematical models of the lower urinary tract have been published [2], 
[3], [4], [5]. They focus on solving the problem from a global approach. In this 
present study, we deal with the problem from a distributed viewpoint, with emergent 
characteristics. We consequently define a framework-based on agents which will 
enable us to model the lower urinary tract neuronal regulator showing a particular 
structure and performance. 

Multiagent systems provide a paradigm capable of supplying reasonable sufficient 
expressive capacity to tackle the development of such distributed systems, accounting 
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for a wide range of contingencies, emerging behavior, and the possibility of structure 
modification as new advances are made in neurological research. 

In the following sections, we analyze an agent-based formal framework which will 
enable us to define the model of the neuronal regulator of the biological system. In 
section 3, we eventually define the model once the characteristics are presented and 
the functioning of the system is explained. The results are included in section 4, and 
in section 5, we extract our conclusions, and offer an outline of the different areas of 
work we are currently involved. 



2 The Framework 

At general level, we assume that a biological system is made up of a mechanical 
system (MS), a neuronal regulator system (NRS) that controls the mechanical part and 
an interface communicating both systems. Formally, 

Biological_System = (MS, NRS, 

Next we define each one of the elements that forms the biological system. 



2.1 Interface of the Systems (“Xrs) 

The interface regards the biological system as a system of actions and reactions, using 
the following structure: 

“\,, = (i,r,p> (2) 

where Z represents the group of possible states of the system. T identifies the group 
made up of the possible intentions of actions on the system. The entities do not have 
overall control of the system and they also have to combine their objectives with 
those of other entities so that the result of each action will be represented as an 
intention of action on the system. Finally, P is the set of all the possible actions that 
the entities can perform on the system. 

The states of the system (Z) can be expressed by the values of the different sensor 
and actuator signals that act as an interface with the others systems. Each state Oj e Z 
is defined as a list of pairs (signal, valueSignal): 

<7i = ((sigp valj), (sigj, vay, ..., (sig„, valj) (3) 

On the other hand, the system can change before different actions are made on it. 
The set of the possible influences or attempts of action of the different entities in 
reference to the present state from the system is defined as: 

r={Yi,Y2.-.Y„} (4) 

in which j. is a list of pairs from an element together with its value. 

The entities must carry out actions to be able to act on the system. The set of all the 
possible actions that can be performed in a certain system can be defined as: 

P={Pi,P2,...,p„} (5) 

Each action (pj) is defined in terms of a name, a precondition that describes the 
conditions that must verify the action to be executed, and a postcondition that makes 
inferences on the set of influences that converge on the action being executed. 
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2.2 Entities of the Neuronal Regulator System (NRS) 

The entities of the neuronal regulator system (NRS) are modeled as cognitive agents 
that present a PDE (Perception-Deliberation-Execution) architecture [6] with a 
modified execution function [7]. The capacity to memorize has been incorporated to 
those agents in order to obtain a richer and more powerful deliberation: 

NRS = (a,, a^, a^> (6) 

An agent ^ ^ NRS formally described using the structure: 

a = (Oa, Sa, Percepta, Memo., Decision^, Exec^) (7) 

where corresponds to the set of perceptions, Sa to the set of internal status, 
Percepta provides the centre with information about the state of the system, Mema 
allows the centre to show awareness of the state, Decisiona selects the next influence, 
and ExeCa represents the agent's intention of acting on the system. These functions 
present a general structure that depends on each agent's specified sets and functions 
[8]. We can appreciate the internal structure of an agent in the fig. 1. 

Eor an agent, perception is the quality of being able to classify and to distinguish 
states of the system. The perception is defined as a function that associates a set of 
values, denominated perceptions or stimuli, with a set of states of the system: 

Percepta : 2 — ^ Oa (8) 

The set of the possible perceptions associated with the agent is defined as: 

Oa = <^,,(^„...,(^„> (9) 

where each (|). is a structure composed of a list of pairs formed by an element and its 
value corresponding to the state of the system previously defined. 

Each agent has an internal state that confers the capacity to memorize and to 
develop a complex behavior. The set of internal states of a certain agent is defined as: 

Sa= {Sj, S„ ..., Sp> (10) 

On the other hand, the decision function submits an action to the perception in a 
determined internal state of the agent: 

Decisiona ■ ®a xSa— ^P (11) 

The decision function depends on the precondition decision function (PreDa((|),s)) 
that relates a true or false value to a perception in a given internal state, a function 
(EunDa(<|),s)) associates a list of actuators signals the new values the agent has 
acquired a perception function. 

The memorization function of information happens when switching to another 
internal state; that is, it will relate an internal state of the agent to a perception in a 
given internal state: 

Mema : X Sa ^ Sa (12) 

The memorization function depends on a precondition memorization function 
(PreMa((t),s)) that relates a true or false value to a perception in a particular internal 
state, a function (FunDa((|),s))that associates a new internal state with a perception in a 
given internal state. 

Once the agent has decided what action to take, it must execute. The actions on the 
system are carried out by means of the execution function defined as 

ExeCa : P X Oa ^ r (13) 

where P is the set of actions that can be made on the system, Oa is the set of the 
possible perceptions that the agent a can have of the system and T is the set of 
influences of the different agents. 
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Fig. 1. Representation of the system. Agents perceive how the world is from the state of the 
system (o(t)). With the perception ((|)|(t)) and the internal state (Sj(t)) it will change to another 
internal state (Sj(t+1)) and it will decide what action to take (p). The execution of that action 
will generate an influence (y.) that will try to act on the system 



2.3 Mechanical System (MS) 

This paper focuses on the neuronal regulator of a biological system. However, since 
the mechanical system (MS) and the neuronal regulator system are so closely 
connected, let’s introduce the function of the mechanical system in this context: what 
it does is to generate afferent signal from a given set of efferent signals. This function 
complies with the dynamics of the mechanical system and is carried out through 
accomplishment of actions with the aim of transforming one state into another. This 
change is regarded as reaction of the system under different influences. 

Function MS provides the information about the current state of the system 
subjected to the influences from different entities: 

MS:Ixr^Z (14) 

Taking into account the above definition for agent, and the perception of the 
environment at a certain point, the new state of the system results as the assessment of 
influences from the different agents when they are concurrently performing their 
tasks: 

o(t +1) = MS(o(t), n’ ExeCj (Decisionj((|)|(t), Sj(t)), ([),(t))) Sj(t+1) = 

Memj((t)^(t), Si(t)); ... ; s,(t+l) = Mem„((t)Xt), s/t)) 

Fig. 1 shows graphically to the structure of the system where the agents and their 
relations are established. 



3 The Biological System 

The previous section has set the general framework. This section will analyze a model 
of the neuronal regulator of the lower urinary tract using the proposed general 
framework. 
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Fig. 2. Structure of the biological neuronal regulator. Afferent (A), efferent (E), voluntary (I) 
and internal signals. CD - cortical diencephalic, AP reoptic area, PAG - periacqueductal grey 
area, PS - pontine storage, PM - pontine micturition, SM - sacral micturition, SS - sacral 
storage, DGC - dorsal grey commissure, TS - thoracolumbar storage 



3.1 Overview of the Biological System 

The lower urinary tract (LUT) is made up of a mechanical part, comprising the 
bladder and the urethra that allow urine to be stored and expelled from the body, and a 
neuronal part that controls these two functions. The complexity of the neuronal 
regulator of the LUT can easily be appreciated if we take into account that both reflex 
and voluntary mechanisms are involved in its functioning [9]. 

The biological neuronal regulator of the LUT is made up of neuronal centers and 
communicating paths. The latter connects the mechanical system and the neuronal 
centers [10]. Information streams from the mechanical system to the neuronal centers 
where it is processed and forwarded to the mechanical system so as to contract or 
relax the muscles involved [8]. Fig. 2 shows the structure of the neuronal regulator of 
the lower urinary tract. 

One of the most important centers is the sacral micturition (SM). It is a neuronal 
centre related to the regulating biological function of urine emptying [8]. Several 
inputs and outputs are identified in the SM [11]. Associated to the SM centre, two 
involuntary facilitator loops are identified at sacral level: the vesicoparasympathetic 
loop that is activated when the signal goes beyond the threshold and the 
uretrovesicalparasympathetic loop that is activated when the signal goes beyond 
the threshold 





An Agent Based Framework for Modelling Neuronal Regulators 



375 



3.2 Model of the Biological System 

On the ground of the formal framework already proposed in section 2, we formally 
define the lower urinary tract (LUT) using the tupla: 

LUT = (MLUT, RLUT, ( 1 6) 

in which the MLUT models the mechanical part of the lower urinary tract, the RLUT 
the neuronal regulator of the lower urinary tract, and finally the the relation 

between both parts. 

The neural regulator of the lower urinary tract consist of the set of neuronal centers 
(NC) that are constantly perceiving, deliberating and executing. 

The states of system (o.) are list of pairs composed of the afferent and efferent 
neuronal signals with their corresponding values. 

The influences of an agent (y) are stated by a list of pairs of its efferent neuronal 
signals together with the new values that the agent wants to obtain. 

The tasks performed by an agent at a certain moment are associated with the 
influence the agent wants to exert on the system. 

On the other hand, bearing in mind that the neuronal centers are constantly 
registering information from the mechanical system and from other centers, and that 
they act as autonomous entities by means of efferent signals on the mechanical system 
and by means of internal neuronal signals on other centers, we model the centers as 
PDE agents. By example, the sacral micturition centre (SM) is defined by: 

SM = S,„, Percept,^, Mem,„, Decision,^, Exec^^) (17) 

The perception function associated with the SM (Percept^^,) provides the group of 
signals of the state of the world whose origin or destination is this neuronal centre. 

The internal state of the SM is formed by the internal neuronal signals of origin or 
destination; that is, the input and the output neuronal signals. 

According to its current state and what it perceives, the centre will be able to 
change to a new state and decide what action to take. The decision function 
(Decisionj„) presents a general structure that depends on its internal functions 
(PreDj,j,((|),s) and EunDj,j,((|),s)) [11]. As is the case with the decision function, the 
memorization function (Memjj,) also presents a general structure dependent on its 
internal functions (PreMj,^,((|),s) and EunDj,j,((|),s)) [11]. These functions are defined in 
table 1. 



Table 1. Internal functions of decision and memorization. (|) - perception; S - internal state; ts() 
- translation function; 1 - PreDj,j_j((]),s); 2 - FunDj,j_j((|),s); 3 - PreMj,,^(<|),s); 4- FunDj,j_j((j),s) 



(|).°A 




s.™L 


s.^'T, 


ts(t) 


1 


2 


3 


■ 


ts(t+1) 
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False 


<((|).™E.0)> 
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0 


I 
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By means of a translation function, ts(t), the internal states of the centre are 
associated with the different segments of the vesical pressure curve, identified by the 
different phases of the system [8] (I corresponds to a inactive state; MB, MM, MA 
and M to a micturition state; R is a retention state), thus simplifying its current state. 



4 Experiments 

To validate the model of the LUT, a simulator has been developed using Java as the 
programming language together with the support of a graphical representation tool. 
We have carried out different LUT simulations with data regarding both healthy 
individuals and those with dysfunctions due to neuronal causes. 



4.1 Situation without Dysfunctions 

The result of the tests in average working conditions, without dysfunctions, can be 
observed in fig. 3. In the storage phase, the centre remains inhibited, allowing the 
bladder to be filled. During the emptying phase, the person activates the micturition 
centers to contract the detrusor to expel the urine. 

In the first part of the storage phase, we can observe that an increase in urine 
generates exponential increases in pressure because of the initial stretching of the 
muscle. When the micturition begins, a contraction of the detrusor takes place 
generating a great increase in vesical pressure, reaching values of 40 cm of H^O. At 
the end of the process, the bladder will have practically emptied, maintaining a basal 
pressure. The urine flows out when the external sphincter is opened. It can be seen 
how the output flow of urine increases to 23 ml/seg. As can be observed, urodynamic 
curves fall within the permitted ranges for the International Continence Society [12]. 



4.2 Situation with Neuronal Dysfunctions 

When there is a lesion that affects interaction among the sacral centers and the rest of 
the neuronal centers (thoracolumbar centre, pontine centre and suprapontine centers), 
interaction ceases. The LUT no longer controls voluntary and involuntary areas, but 
the vesicosomatic guarding reflex and the vesicosimpatic and urethralparasympathetic 
reflexes of micturition remain. A lesion of this type usually generates a detrusor- 
sphincter disinergy [13]. Fig. 4 shows the urodynamic curves of a suprasacral lesion. 
In it, detrusor-sphincter disinergy can be observed. When detrusor pressure is greater 
than sphincter pressure, urine loss takes place. 

Urodynamic data provided by the simulator show similarities to real clinical data 
and are alike as far as dynamics is concerned [12]. 
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Fig. 3. Urodynamic data (vesical volume, vesical pressure, external sphincter pressure, urine 
outflow) obtained by the simulator in a situation without dysfunctions 




Fig. 4. Urodynamic data (vesical volume, vesical pressure, external sphincter pressure, urine 
outflow) obtained by the simulator in suprasacral dysfunctions 
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5 Conclusions 

In this paper an agent paradigm-based framework is presented. This paradigm, widely 
used in other fields such as robotics or communications, share interesting features that 
cover the implicit requirements common to the majority of biological control models: 
distribution, adaptability, emergence, etc. 

The discussed framework delivers a model of the neuronal regulator of the lower 
urinary tract. The model presents an independent formulation of the present 
knowledge. Moreover, its modular conception makes it not only versatile but also 
capable of being enriched by further development in the field. 

A simulator has been used to validate the model and with it, urodynamic graphs 
have been obtained and comparisons have been made with real data obtained from 
healthy individuals and others with simulated disorders. This simulator can either be 
used by specialists in research tasks to discover new information about the 
mechanical neuronal operation of the lower urinary tract or to complement existing 
data. In hospitals, it can be of use as a nucleus of a monitoring and diagnostic aid tool. 
In the field of education, it can be used to discuss pathological disorders connected to 
the lower urinary tract. 

Our most immediate task is to enlarge the model by improving the cognitive 
capacity of the agents in order to create an artificial system capable of self-regulating 
and of reducing the problems posed by the biological regulator. Eventually, the 
outcome of this piece of research will hopefully be used as a basis for designing 
control device to be implanted in the human body. 
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Abstract. In cooperative problem solving, while some communication is neces- 
sary, privacy issues can limit the amount of information transmitted. We study 
this problem in the context of meeting scheduling. Agents propose meetings con- 
sistent with their schedules while responding to other proposals by accepting or 
rejecting them. The information in their responses is either a simple accept/reject 
or an account of meetings in conflict with the proposal. The major mechanism of 
inference involves an extension of CSP technology, which uses information about 
possible values in an unknown CSP. Agents store such information within ‘views’ 
of other agents. We show that this kind of possibilistic information in combination 
with arc consistency processing can speed up search under conditions of limited 
communication. This entails an important privacy/efficiency tradeoff, in that this 
form of reasoning requires a modicum of actual private information to be maxi- 
mally effective. If links between derived possibilistic information and events that 
gave rise to these deductions are maintained, actual (meeting) information can be 
deduced without any meetings being communicated. Such information can also 
be used heuristically to find solutions before such discoveries can occur. 



1 Introduction 

Constraint satisfaction is a powerful technology that has been successfully extended 
to distributed artificial intelligence problems [1]. Within the multi-agent setting new 
problems arise when agents have a degree of independence. Most systems of this sort have 
been built on the assumption that agents will be completely open about communicating 
information that might be relevant to solving a problem [2] [3]. This may not always be 
the case in such settings; agents may want to maintain their privacy as much as possible 
while still engaging in collaborative problem solving [4] . Since holding back information 
may impair the efficiency of problem solving, there is a potentially important tradeoff 
between privacy and efficiency that must be considered. 

In any ‘real-world’ situation, privacy issues are tied up with agent intentions (as de- 
scribed, for example, in [5]). This means that any agent’s agenda (its goals or intentions) 
potentially involves minimizing the loss of its own private information and/or gaining 

* This work was supported in part by NSF Grant No. IIS-9907385 and Nokia, Inc and by Science 
Foundation Ireland Grant No. 00/PI. 1/C075. 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 380-389, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




Possibilistic Reasoning and Privacy/Efficiency Tradeoffs in Multi-agent Systems 



381 



information about other agents. Here, we assume both goals are in operation; the ques- 
tion then is, how to manage problem solving given these intentions. More specifically, 
we study how search can be conducted under conditions of limited communication, as 
well as the degree of privacy loss actually incurred under such conditions. 

Because of privacy concerns, agents may need to operate under conditions of partial 
ignorance. In such cases, even though critical information may not be known, agents 
may be able to reason in terms of sets of possibilities, such as the set of possible values 
for a known variable. In this paper we show how this can be accomplished. This entails 
the use of a new formal structure that supports consistency reasoning under conditions of 
partial ignorance. The soundness of this approach is proven in detail elsewhere [6]. This 
paper demonstrates the effectiveness of this approach under conditions where agents 
would otherwise need to proceed more or less blindly. 

We demonstrate the efficacy of our methods by examining a simplified situation, 
which we think can be extended to more realistic scenarios. We use an independent 
agent paradigm, where agents communicate with each other to solve a problem of mutual 
interest. Rather than solving parts of a single problem, as in the distributed CSP paradigm 
[1], here each agent has its own problem to solve, but portions of the individual solutions 
must be mutually consistent. The specific application is a type of meeting-scheduling 
problem, where agents have pre-existing schedules and need to add a new meeting that 
all can attend. 

We first consider the kinds of information that agents can derive about other agents’ 
schedules in the course of a meeting scheduling session. We show that such information 
can be encoded using ideas from standard modal logic, in which terms with modal 
properties are treated as as kind of CSP value. We show that such information gathering 
can be enhanced by consistency processing based on general temporal relations assumed 
to hold for any schedule. 

Regarding the privacy/efficiency tradeoff, we find that if agents reveal private infor- 
mation (portions of their schedules) without being able to reason about it (via consistency 
processing), there is no gain in efficiency, i.e. no tradeoff. If they can reason in this way, 
then efficiency can be markedly improved provided that agents reveal small amounts of 
private information; thus, under these conditions there is a clear tradeoff. We also show 
that more sophisticated reasoning techniques can be used either to gain private infor- 
mation under conditions of limited communication or to reduce search while revealing 
very little information. This raises new issues in regard to privacy loss and tradeoffs with 
efficiency. 

Section 2 describes the basic problem for our agents. Section 3 introduces the idea of 
“shadow CSPs" based on possibilistic information, that can represent an agent’s current 
knowledge about another agent’s schedule. Section 4 describes a testbed and experiments 
that test effects of different levels of communication and forms of knowledge (actual 
and possible) on efficiency and privacy loss. Section 5 describes the basic experimental 
results. Section 6 shows how knowledge of possibilities linked to communications that 
gave rise to it can be used to deduce actual information about a schedule and to support 
heuristics for gaining efficiency without such loss. Section 7 gives conclusions. 
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2 A Meeting Scheduling Problem 

In the scheduling problem we are considering, each of k agents has its own calendar, 
consisting of appointments in different cities at different times of the week. The problem 
is to find a meeting time that all agents can attend given their existing schedules and 
constraints on travel time. Agents communicate on a 1:1 basis; the basic protocol is for 
one agent to suggest a meeting time in a certain city to each of the other agents, who then 
tell the first agent whether the choice is acceptable or not given their existing schedules. 

For analysis and experimentation, we devised a simplified form of fhis problem. 
First, we assume a fixed set of cities where meetings can be held: London, Paris, Rome, 
Moscow and Tbilisi. We also restrict meeting times to be one hour in length and to 
start on the hour between 9 AM and 6 PM, inclusive, on any day of the week. These 
restrictions apply to pre-existing schedules as well as the new meeting assignment. 

The basic constraints are the times (in hours) required for travel between meetings 
in different cities, shown in Figure 1 . Times between cities within one region (Western 
Europe or the former Eastern Bloc) are shown beside arcs connecting cities; the arc 
between the two ellipses represents constraints between any two cities in the different 
regions. 




Fig. 1. Time constraint graph. Cities are London, Paris, Rome, Moscow and Tbilisi. 



3 Communication and Inference (About Other Agents’ Schedules) 

In the situation we are considering there are three basic kinds of message: proposals, 
consisting of a city, day and hour (“Paris on Monday at 3 PM"), acceptances, and re- 
jections. In addition, under some experimental conditions agents give reasons for their 
rejection by communicating one or all conflicts (“I have a meeting in Rome on Monday 
at 5 PM."). This allows agents to reveal small amounts of private information, which 
has the potential to speed up search. This kind of communication can be considered as 
a strategy that is tied specifically fo privacy management and whose goal is to handle 
any privacy/efficiency tradeoff. A similar but less focused strategy, that of exchanging 
partial calendars, was used for this purpose in [4]. 

In the present situation, a solution to each agent’s problem depends on fc — 1 other 
problems that it does not know directly. However, information about these problems can 
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be represented in “views" of other agents’ schedules. A view can be updated after each 
communication from another agent. Since an agent does not actually know the other 
schedules, but it does know the basic constraints and meeting locations, it can deduce 
facts about other schedules in terms of possible values at any point in the session. 
This information can be used to guide selection of proposals, both deterministically (by 
deducing that meetings are unavailable) and heuristically (by avoiding proposals that 
conflict with possible existing meetings). 

Deductions are made on the basis of the communications. From proposals and ac- 
ceptances, the agent that receives the message can deduce that the other agent has an 
open time-slot. It can also deduce that certain meetings that might have been in the 
other agent’s schedule are, in fact, not possible; otherwise, given the travel constraints, 
the agent could not have proposed (or accepted) a given meeting in that slot. From a 
rejection a simple reduction in the set of possibilities can be made that refers to the 
meeting just proposed. In addition, the agent receiving this message can also deduce a 
disjunctive set of possible causes for this rejection. Finally, if even a small number of 
actual meetings are communicated, then many more possibilities can be excluded. 

The approach we use to carry out these deductions combines CSP ideas with basic 
concepts from modal logic. In addition to gathering actual information about other 
agents’ schedules, agents keep track of possibilities regarding other agents’ meetings. 
This information is maintained in CSP-like representations, where time-slots are again 
taken as variables. In one type of CSP we have possible values for meetings that another 
agent may already have in its schedule, which we call “possible-has-meeting" values. 
In another, values represent meetings that an agent might be able to attend, termed 
“possible-can-meet" values. Considered more generally, the former represent possible 
existing assignments in an unknown CSP, while the latter represent possible future 
assignments in the same CSP. A third type represents possible causes for any rejection 
made by the other agent, termed “possible-cause" or “possible-conflict" values. 

To indicate the close semantic relation between these CSPs, which represent possi- 
ble values, and the actual CSP of the other agent, we call the former “shadow CSPs". 
Appropriately, shadow CSPs cannot be said to exist on their own, as can ordinary CSPs. 
Moreover, they do not have solutions in the ordinary sense. However, they can be de- 
scribed as a tuple, P = (V, D, C), consisting ofvariablesC, domains I? and constraints, 
C, just like ordinary CSPs. 

The possibilistic character of domain values of shadow CSPs can be represented by 
the possibility operator, O, from standard modal logic (cf. [7]). We, therefore, refer to 
these values as “modal values". In this paper, specific modal values are always referred 
to in conjunction with the modal operator, e.g. Ox. Inferences involving such values 
are subject to the rules of modal logic as well as ordinary truth-functional logic. (The 
connection to modal logic is described in more detail in [6].) 

In this situation, we can make deductions from actual to possible values under a 
closed world assumption, in that CSP domains can be considered closed worlds. In 
this case, -<x, entails D-ia;, where □ is the necessity operator. But, by a standard modal 
equivalence, D-ia; = -lO-i-ix = -lOa;. Under this assumption, therefore, an agent can 
make deductions from whatever ‘hard’ information it can glean during the scheduling 
session in order to refine domains in the shadow CSPs. 
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For instance, when another agent makes or accepts a proposal, up to hve possible- 
has-meeting values can be deleted, since under closed world assumptions -icc entails 
^Ox. Moreover, if x implies -ly, then we can also deduce -■Oy; hence, arc consistency 
based on the original constraint graph can be used to delete possible-has-meeting values 
for other variables (i.e. other cities in nearby time slots). 

For possible has-meetings, inferences can also be made back to the actual values. 
This is because -lOx implies -ix under standard assumptions. If, for all cities x associated 
with a single time-slot, we have inferred -lOx, we can then infer that the agent has no 
meeting at that time, i.e. it has an open slot. 

From a simple rejection, an agent can only infer that one possible-can-meet value 
is invalid. However, if an existing meeting is given as a reason, it is possible to remove 
up to five possible-can-meet and four possible-has-meeting values from that time slot, 
and, using arc consistency reasoning, to delete other possible-can-meet and possible- 
has-meeting values based on the known constraints between hard values. 

From a rejection, an agent can also use arc consistency to deduce possible-cause 
values, i.e. it can deduce the set of possible causes for that rejection. Unfortunately, 
the set of possible causes for a rejection forms a disjunctive relation, in contrast to sets 
of inferred possible-has-meeting and possible-can-meet values, which are conjunctive. 
However, since possible-cause values must be included in the set of possible-has-meeting 
values, this subset relation allows agents to prune values of the former kind, so that 
knowledge about another agent’s schedule can be refined. 




Fig. 2. Structure of shadow CSP system for the meeting-scheduling problem. Arrows represent set 
inclusion (or “realization") relations holding between domains of super- and subordinate CSPs. 



To show that this deductive machinery is valid, we establish a system composed of 
the actual and shadow CSPs, establish certain requirements for the system to be well- 
structured, and show that these requirements are satisfied at every stage of the search 
process. We add to the present set of CSPs a supremum whose domains contain the set 
of all possible domain values and an infimum which in this case is composed of null sets. 
Then, the corresponding domains of these CSPs can in each case be arranged in a partial 
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order under the relation of set inclusion (see Figure 2). (Corresponding domains are those 
associated with the same variable.) The proof that the system remains well-structured 
shows that, with the present communications and rules of inference, this relation remains 
achievable at every step in search [6]. In particular, this means that we cannot deduce 
an actual value that is not (potentially) contained in the corresponding domain of all 
superordinate shadow CSPs. In other respects, the system is sound because it relies on 
standard logic, plus the closed world assumption. 

At the beginning of a session when an agent knows nothing about other agents’ 
schedules, all domains of the possible-has-meeting and possible-can-meet CSPs, as 
well as the “universal" shadow CSP, have five modal values, corresponding to the five 
possible cities. The possible-cause shadow and the actual CSP have empty domains. As 
search proceeds, values are deleted from the domains of the first two CSPs and added 
to the last two. The soundness of the deduction rules insures that the system remains 
potentially well-structured throughout search. 

4 An Experimental Testbed 

4.1 System Description 

Although the previous section and the work referred to demonstrates the soundness of 
our deductive system, we still need to know how well it will perform in practice. Will 
reasoning based on possibilities improve efficiency? Under what conditions and to what 
degree? And how will it affect the privacy/efficiency tradeoff? 

To study these issues, a testbed system was built in Java. The system allows the user 
to select the number of agents and initial meetings. In addition, the user can select the: 

• level of communication (xor): (i) a ‘mimimum’ level consisting of the three basic 
messages, propose, accept and reject, (ii) a level in which each rejection is accom- 
panied by one reason, i.e. one meeting that conflicts with the proposal, (hi) a level 
in which all such reasons are given. 

• knowledge to be gathered about other agents (ior): actual knowledge (meetings 
and open slots), possible-can-meet nogoods, possible-has-meeting nogoods, and 
possible-cause values. 

• optional use of arc consistency processing (as described above). 

• proposal strategy (xor) : (i) blind guessing, where agents choose any time slot allowed 
by their own schedules without remembering them, (ii) guessing where previous, 
rejected proposals (by any agent) are avoided, (iii) proposals are guided by accu- 
mulated knowledge, of whatever kind chosen to be gathered, (iv) proposals are also 
guided by heuristics based on this knowledge. 

• protocol (xor): (i) round robin, where each agent makes a proposal in turn, (ii) “one 
coordinator", where all proposals are made by one agent. 

4.2 Design of Experiments 

Most experiments reported here involve three agents, where the number of initial meet- 
ings varies from 5 to 40 in steps of 5. In each experiment, an individual test run begins 
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with random generation of schedules followed by a series of proposals which con- 
tinue until one is found that is acceptable to all agents. The present experiments use the 
“round-robin" protocol, which is democratic and allows agents to update their knowledge 
efficiently. A similar protocol was used in [4] . 

At each step of an experimental run, candidate proposals are generated for one agent 
by choosing a time slot and city at random to ensure unbiased sampling. This is repeated 
until the candidate fits the proposer’s schedule. Then depending on the experimental set- 
tings, further tests may be made against this agent’s knowledge, for example, knowledge 
of other agents’ actual meetings, current possible-has-meeting’s, etc. The first proposal 
that passes all these tests is communicated to all the other agents, and the latter reply 
with an acceptance or rejection (with or without reasons) to the proposer alone. 

During a run, modal values are deleted from the possible-has-meeting and possible- 
can-meet shadow CSPs; in the implementation they are stored as nogoods. Nogoods 
are generated either by direct inference or through arc consistency processing after a 
message has been received, as described in Section 3. 

In these experiments, the efficiency measure is number of proposals per run, averaged 
over all 500 runs of an experiment. The measures of privacy lost are number of meetings 
identified, number of open slots identified, and number of modal values removed from 
the shadow CSPs for possible-can-meet’s and possible-has-meeting’s. Privacy tallies are 
averaged per agent view per run. (There are two views per pair of agents, or n x (n — 1)) 
views for n agents.) In addition, number of solutions per agent was determined as well 
as the number of common solutions. Differences in means are evaluated statistically, 
using f-tests or analyses of variance. 

5 Efficiency and Privacy: Empirical Results 

Figure 3 shows measures of efficiency and privacy loss for the “baseline" condition of 
minimal communication, where no knowledge is used except past proposals. (For later 
comparisons, actual and possibilistic knowledge was collected but not used to guide 
proposals.) The curvilinear relation between number of initial meetings and number of 
proposals required to find a solution is due to the relation between number of common 
solutions and number of average personal solutions which decreased at different rates. 
Note the large number of possible has-meeting’s discarded in relation to the number of 
can-meet’s. 

Effects of varying conditions of communication and inference are shown in Figure 
4 for 15 initial meetings. The same pattern of results was found with other numbers of 
initial meetings. The main conclusions from these experiments are as follows (statistical 
comparisons are with the corresponding baseline condition): 

• explicit communication about meetings does not improve efficiency, despite the 
giving up of ‘hard’ information (“1" or “all conflicts", f(499) < 1.24, ns.) 

• efficiency is not improved by deriving modal information even when this is enhanced 
by arc consistency reasoning (“knowledge" and “know-tAC", f(499) < .36, ns), 
which allows deduction of many more possible-has-meeting nogoods. 

• a combination of explicit information interchanged, information in the form of modal 
values, and arc consistency processing results in a marked improvement in efficiency 
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Fig. 3. Efficiency and privacy measures in ‘baseline’ experiments, where no explicit meeting 
information is communicated and no information is used except previous proposals. 




Fig. 4. Efficiency and loss of ‘hard’ information concerning meetings and open slots under vary- 
ing conditions with respect to level of communication, knowledge, and consistency reasoning. 
Experiments with three agents and 15 initial meetings. 



(“know AC- 1" and “-all", f (499) > 8.59, p « .01). In this case, large numbers 
of possible-can-meet nogoods are deduced. 



6 Linking Possibilities to Causes (Specific Communications) 

If derived information is classified by the kind of event that gave rise to it, this can affect 
both privacy loss and efficiency. For example, if possible-has-meeting nogoods deduced 
from conflicts are distinguished from those deduced from proposals and acceptances, 
then open slots deduced from the former are times when an agent cannot meet. Subse- 
quent proposals should, therefore, not involve those time-slots. This extension (and the 
next) can be easily incorporated into the shadow CSP system in a way that maintains its 
soundness [6]. 
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Another form of deduction based on a similar strategy involves possible-cause values. 
If the relation between possible-cause values and the rejection that gave rise to them is 
retained, then values associated with a specific rejection can be pruned. If such a set is 
reduced to one value, it can be concluded that this meeting must be in the schedule of 
the agent that made the rejection. Thus, it is possible to deduce a meeting of another 
agent even when no explict meeting information is exchanged. In empirical tests, agents 
using this strategy discovered a number of actual meetings. For 15-25 initial meetings, 
the means were > .1 per agent view, while for the longest runs up to 2-3 meetings 
were found per view. Efficiency was not improved, however, because of the number of 
proposals required to gain such information. 

Possible-cause values stored in this manner can also be used to heuristically guide 
proposal selection. Given a candidate proposal, if any possible-cause values associated 
with the nearest rejections before or after would also conflict with the new proposal, 
then the latter is (temporarily) avoided. 




Fig. 5. Performance with and without heuristic based on possible causes for problems with 20 
agents having 7 initial meetings. Log survivorship curves show number of unfinished runs after k 
proposals. Each curve is based on a total of 500 runs. 



This heuristic does improve efficiency, without actual meetings being deduced, e.g. 
for 15 initial meetings, there was an average improvement of 20% (f(499) = 3.27, 
p < .01). More importantly, the range of values for number of proposals across all 
runs is drastically reduced. This occurs because the precision of the heuristic improves 
dramatically as possible-cause lists are reduced. This is shown in Figure 5 for a much 
harder problem involving 20 agents and seven initial meetings. In this experiment the 
maximum number of proposals was 275 without and 188 with the heuristic. In addition, 
total runtime was reduced by a factor of 2-3. 
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7 Conclusions 

Work on modeling agents has considered models that incorporate ideas from modal 
logic. However, these models have been concerned with representing agent cognition in 
order to demonstrate consistency between beliefs and goals and similar issues [8]. There 
seems to be little work done on agent views that corresponds to the present research. 

Perhaps the most significant aspect of this work is that we have shown how con- 
sistency reasoning can allow agents to hnd solutions more efficiently under conditions 
of ignorance. In doing so, we have developed a new CSP formalism that makes the 
possibilistic aspects of such reasoning more explicit and more coherent. 

Signihcant improvements in efficiency were obtained either by giving up a limited 
amount of private information in communications or by linking modal information to 
specific causes (here, communications), which could in turn be associated with specific 
CSP elements in the form of candidate proposals. In both cases, possibilistic information 
was essential. In the former case, the critical feature was the deduction of possible-can- 
meet nogoods; in the latter case, it was a combination of possible-cause and possible- 
has-meeting values. 

In the present situation we discovered some ways to improve efficiency, and in so 
doing we also established some parameters for privacy/efficiency tradeoffs. Most im- 
portantly, we found that such tradeoffs do not emerge unless communication is accom- 
panied by effective reasoning. Under these conditions, the tradeoff can be modulated 
by variations in communication and even to a degree finessed by sophisticated heuristic 
techniques. 
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Abstract. Conventional artificial neural network models lack many 
physiological properties of the neuron. Current learning algorithms are more 
concerned to computational performance than to biological credibility. 
Regarding a natural language processing application, the thematic role 
assignment - semantic relations between words in a sentence the purpose of 
the proposed system is to compare two different connectionist modules for the 
same application: (1) the usual simple recurrent network using backpropagation 
learning algorithm with (2) a biologically inspired module, which employs a bi- 
directional architecture and learning algorithm more adjusted to physiological 
attributes of the cerebral cortex. Identical sets of sentences are used to train the 
modules. After training, the achieved output data show that the physiologically 
plausible module displays higher accuracy for expectable thematic roles than 
the traditional one. 



1 Introduction 

Several connectionist natural language processing systems often employ recurrent 
architectures instead of feedforward networks. These systems with “reentrancy” are 
expected to he more adequate to deal with the temporal extension of natural language 
sentences, and, at the same time, they seem to be physiologically more realistic [1]. 
Other biological features are being taken into account in order to achieve new models 
that restore the artificial neural systems first concerns. Connectionist models based on 
neuroscience are about to be considered the next generation of artificial neural 
networks, inasmuch as nowadays models are far from biology, mainly for 
mathematical simplicity reasons [2] . 

In this paper, it is compared two distinct connectionist modules of a system about 
the thematic role assignment in natural language sentences: a conventional simple 
recurrent network employing the backpropagation learning algorithm (TRP-BR) with 
a bi-directional architecture using a biologically plausible learning algorithm, adapted 
from the Generalized Recirculation algorithm [3] (TRP-GR). Through the same set of 
test sentences, it is shown that, for the same training set, the neurophysiological 
module reflects better the thematic relationships taught to the system. 
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2 Thematic Roles 

Linguistic theory [4] refers to the roles words usually have in relation to the predicate 
(often the verb) as thematic roles, so that the verb break, for instance, in one possible 
reading of sentence (1), assigns the thematic roles agent, patient, and instrument, 
because the subject man is supposed to be deliberately responsible for the action of 
breaking (the “agent”), the object vase is the “patient” affected by the action, and the 
complement stone is the “instrument” used for such action. 

The man broke the vase with the stone . (1) 

But the thematic structure can change for some verbs. So, in sentence (2), there is a 
different thematic grid ([CAUSE, PATIENT]) assigned by the same verb break, since the 
subject ball causes the breaking, but in an involuntary way. 

The ball broke the vase . (2) 

Verbs presenting two or more thematic grids depending on the sentence they take 
place, like the verb break, are named here as thematically ambiguous verbs. In a 
componential perspective, it is possible to have a representation for verbs 
independently of the sentence in which they occur. Considering sentences (1) and (2) 
again, it seems that the nouns employed as subjects make the distinction between 
AGENT and CAUSE. In other words, thematic roles must be elements with semantic 
content [5]. 

One of the reasons that the thematic assignment is chosen for a connectionist 
natural language processing application is because of its componential feature. Details 
can be found in [6]. 



2.1 Word Representation 

In the system presented, word representation is adapted from the classical distributed 
semantic microfeature representation [7]. Twenty three-valued logic semantic 
microfeature units account for each verb and noun. Table 1 and table 2 display the 
semantic features for verbs and nouns, respectively. See also the microfeatures for 
two different readings of the thematically ambiguous verb break on table 3 [8]. 

It is important to notice here that the microfeatures for verbs are chosen in order to 
contemplate the semantic issues considered relevant in a thematic frame. The 
microfeatures outside this context are not meaningful. They only make sense in a 
system where the specification of semantic relationships between the words in a 
sentence plays a leading role [6]. 



3 TRF-BP 



The Thematic Role Processor (TRP) is a connectionist system designed to process the 
thematic roles of natural language sentences, based on its symbolic-connectionist 
hybrid version [9]. For each input sentence, TRP gives as output, its thematic grid. 




392 



J.L. Garcia Rosa 



Table 1. The ten semantic microfeature dimensions for verbs. For thematically unambiguous 
verbs, only one feature in each dimension is on 



“positive” feature “negative” feature 



control of action 


no control of action 


direct process triggering 


indirect process triggering 


direction to source 


direction to goal 


impacting process 


no impacting process 


change of state 


no change of state 


psychological state 


no p.'iychological state 


objective 


no objective 


effective action 


no effective action 


high intensity of action 


low intensity of action 


interest on process 


no interest on process 



Table 2. The seven semantic microfeature dimensions for nouns, separated in rows. Only one 
value in each dimension is on for each unambiguous noun (adapted from [7]) 



human 


non-human 


soft 


hard 


small 


medium 


large 


1 -D/compact 


2-D 


3-D 


pointed 


rounded 


fragile/breakable 


unbreakable 


1 value furniture 


food 


toy tool/utensil animate 



Table 3. The semantic microfeatures for the thematically ambiguous verb break, with the 
default reading and two alternative readings (breakl and break!). The “?” sign represents 
ambiguity [8] 



microfeature break breakl breakl 



control of action 


7 


yes 


no 


process triggering 


7 


direct 


indirect 


direction 


goal 


goal 


goal 


impacting process 


yes 


yes 


yes 


change of state 


yes 


yes 


yes 


p.'iychological state 


no 


no 


no 


objective 


7 


yes 


no 


effective action 


yes 


yes 


yes 


intensity of action 


high 


high 


high 


interest on process 


7 


yes 


no 



TRP is deployed in two modules with completely different approaches: BP and GR. 
TRP-BF’ learns through backpropagation algorithm and employs an architecture 
representing a four-layer simple recurrent neural network with forty input units (A), 
fifteen hidden units (B), fifteen context units (D), and ten output units (C), one for 
each of the ten thematic roles: AGENT, PATIENT, EXPERIENCER, THEME, SOURCE, 
GOAL, BENEFICIARY, CAUSE, INSTRUMENT, and VALUE (figure 1). 
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Fig. 1. The four-layer simple recurrent connectionist architecture of TRP-SP. To the input layer 
A the words, represented by their distributed semantic microfeatures, are entered sequentially at 
their specific slots according to their syntactic category: verb or nouns (subject, object, or 
complement). At the output layer C, a thematic role (agent, patient, experiencer, theme, 
SOURCE, GOAL, BENEFICIARY, CAUSE, INSTRUMENT, or VALUE), is displayed as soon as a word is 
entered in layer A. The context layer D represents the memory of the network, to which the 
hidden layer B is copied after each training step [10] 

The input layer is divided in a twenty-unit slot for the verb and another twenty-unit 
slot for nouns. Words are presented in terms of their semantic microfeatures, one at a 
time, at their specific slots, until the whole sentence is completely entered. This way, 
besides semantics, included as part of the distributed representation employed, all 
kinds of natural languages with regard to word order (verb-subject-object -VSO -, as 
well as SVO) could be considered, since a predicate-arguments relation is established. 
At output layer C, thematic roles are highlighted as soon as they are assigned. For 
instance, when the subject of a sentence is presented, no thematic role shows up, 
because it is unknown which will be the main verb, the predicate that assigns such 
role. When the verb appears, immediately the network displays the thematic role 
assigned to the subject presented previously. For the other words, the correspondent 
thematic roles are displayed at the output, one at a time, for every input word. 



3.1 The Biological Implausibility of Backpropagation 

The backpropagation algorithm is largely employed nowadays as the most 
computationally efficient connectionist supervised learning algorithm. But 
backpropagation is argued to be biologically implausible [11]. The reason is that it is 
based on the error back propagation, that is, while the stimulus propagates forwardly, 
the error (difference between the actual and the desired outputs) propagates 
backwardly. It seems that in the cerebral cortex, the stimulus that is generated when a 
neuron fires, crosses the axon towards its end in order to make a synapse onto another 
neuron input (called dendrite). Supposing that backpropagation occurs in the brain, 
the error must have to propagate back from the dendrite of the post-synaptic neuron to 
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the axon and then to the dendrite of the pre-synaptic neuron. It sounds unrealistic and 
improbable. Researchers believe that the synaptic “weights” have to be modified in 
order to make learning possible, but certainly not in this way. It is expected that the 
weight change uses only local information available in the synapse where it occurs. 
That is the reason why backpropagation seems to be so biologically implausible [8]. 



4 TRF-GR 

The module TRP-GR (GR for Generalized Recirculation) consists of a bi-directional 
connectionist architecture, with three layers (A units in input layer, B units in hidden 
layer, and C units in output layer) and lateral inhibition occurring at the output level 
(figure 2). The input and the output operate in the same way as TRP-RP. 



4.1 The Learning Procedure 

The learning procedure of TRP-GR, also employed in [12] and [8], is inspired by the 
Recirculation [13] and GeneRec algorithms [3], and uses the two phases notion 
(minus and plus phases). Firstly, the inputs xi are presented to the input layer. In the 
minus phase, there is a propagation of these stimuli to the output through the hidden 
layer (bottom-up propagation). There is also a propagation of the previous actual 
output ok (t-1) back to the hidden layer (top-down propagation). Then, the hidden 
minus activation hj- is generated (sum of the bottom-up and top-down propagations - 
through the sigmoid activation function, represented by o in equation 3). Finally, the 
current real output ok (t) is generated through the propagation of the hidden minus 
activation to the output layer (equation 4). The indexes i, j, and k refer to input, 
hidden, and output units, respectively. 

A c (3) 

/t/ = + Y,^jk-Okit - 1)) 

(-0 k ^\ 

B (4) 

Okit) = cj{Yj^jk-h-) 

i=i 

In the plus phase, there is a propagation from the input to the hidden layer 
(bottom-up). After this, there is the propagation of the desired output to the hidden 
layer (top-down). Then the hidden plus activation h*\s generated, summing these two 
propagations (equation 5). 

A c (5) 

1=0 k=\ 

In order to make learning possible, the synaptic weights w are updated, based on x^, 
hi, h*, o^{t), and y^, in the way shown in equations 6 and 7. Notice the presence of the 
learning rate {rj), considered an important variable during the experiments [14]. 
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Fig. 2. The three-layer bi-directional connectionist architecture of TRP-Gi?. To the input layer 
A the words, represented by their distributed semantic microfeatures, are entered sequentially at 
their specific slots according to their syntactic category: verb or nouns (subject, object, or 
complement). At the output layer C, a thematic role (AGENT, PATIENT, experiencer, theme, 
SOURCE, GOAL, BENEFICIARY, CAUSE, INSTRUMENT, or VALUE), is displayed as soon as a word is 
entered in layer A. This architecture is similar to the TRP-BP (figure 1), except that there is no 
layer D and the connections between layers B and C are bi-directional 



^w^,=ri.{y^-o,{t)).h. 



(6) 



Aw,j=ri.{h/ -hj ).x, 



(7) 



5 Comparing TRP-G/? with TRP-RP 

Nowadays, neural network models are considered biologically impoverished, 
although computationally efficient. It has been proved that neurophysiologically 
based systems can be as computationally efficient as current connectionist systems, or 
even better [15]. This paper demonstrates that a connectionist system, with 
architecture and learning procedures based on neuroscience features, therefore 
biologically plausible, are also computationally efficient, more efficient than 
conventional systems, at least regarding a particular natural language processing 
application. 



5.1 Training 

A sentence generator supplies 364 different training sentences, presented one word at 
a time, according to semantic and syntactic constraints, employing a lexicon 
consisting of 30 nouns and 13 verbs, including thematically ambiguous verbs. It is 
important to emphasize here that the same training sentences are generated for both 
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modules. After about 100,000 training cycles, which corresponds to an average output 
error' of 10'\ the system is able to display, with a high degree of certainty, the 
thematic grid for an input sentence. 



5.2 Set of Test Sentences 

In order to compare TRP-G/? with TR?-/?/", 16 test sentences, different from training 
sentences, were generated by the sentence generator for both modules. Four of them 
are shown in figures 3 to 6, with their outputs representing thematic roles. These 
sentences reveal the better computational performance of the biologically plausible 
module TRP-GR, at least regarding the sentences belonging to the test set (GR is 
11.61% more efficient than BP). Alternatively to the sentences generated 
automatically, the user can enter by hand the sentence to be tested. 

On figure 3, one can see outputs of the system for the sentence the boy fears the 
man. The current word entered is in bold, while the previous words already entered 
are listed sequentially in parentheses. The first output line is for BP and the second for 
GR. The closer the output is to 1.0 the more precise is the thematic role prediction. 
Notice that in the GR module, the outputs are more accurate, at least regarding the 
expected thematic role for each word. Recall that the first word entered (the subject 
boy) has no thematic role displayed - that is the reason it does not appear in the figure 
- at least until the verb shows up. To the subject (boy) is assigned the thematic role 
EXPERIENCER, because /ear asks for an experiencer subject it is a psychological verb. 



Sentence: the boy fears the man 








Word presented to the system: (boy) fear 


agent patie exper theme sourc goal 


benef 


cause 


instr 


value 


BP 0.001 0.004 0.966 0.000 0.007 0.015 


0 . 020 


0 . 016 


0 .000 


0 .000 


GR 0.000 0.000 0.998 0.031 0.000 0.000 


0 .000 


0 .000 


0 .000 


0 .000 


Word presented to the system: (boy-fear) man 


agent patie exper theme sourc goal 


benef 


cause 


instr 


value 


BP 0.000 0.024 0.054 0.101 0.044 0.049 


0 . 034 


0 .001 


0 .003 


0 .001 


GR 0.000 0.021 0.058 0.168 0.136 0.047 


0 . 015 


0 .000 


0 .000 


0 .000 



Fig. 3. Outputs of the system for the sentence the boy fears the man. Both modules arrived at 
the expected thematic grid [EXPERIENCER, THEME], but module GR values are closer to 1.0 



Figure 4 displays outputs for the sentence the girl bought a ball by a hundred 
dollars. Notice again that in the GR module, the outputs are more precise. 

Figure 5 shows the outputs of the system for the sentence the man hit the doll, with 
the thematically ambiguous verb hit, entered as its default reading, since it is 
unknown for the system which hit is intended. Notice that, in this case, the thematic 
role assigned to the subject must be AGENT instead of CAUSE, because the noun man 
has features associated with being capable of controlling the action, for instance, 
human and animate (see table 2). As one can see, only the GR module gave a suitable 
prediction for the subject thematic role. 



' The average output error is the difference between “actual” and “desired” outputs, and it is 
obtained from the average squared error energy formula [14] for each set of different 
sentences presented to the network. 
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Sentence: the girl bought a 


ball by 


a hundred dollars 








Word presented to the system: (girl) buy 
agent patie exper theme sourc 


goal 


benef 


cause 


instr 


value 


BP 


0.984 


0 .000 


0.015 0.049 0.004 


0 .005 


0 .003 


0 .004 


0 . 016 


0 .000 


GR 


1.000 


0 .000 


0.000 0.002 0.000 


0 .000 


0 .000 


0 .000 


0 .000 


0 .000 




Word presented to the system: (girl-buy) ball 
agent patie exper theme sourc 


goal 


benef 


cause 


instr 


value 


BP 


0 . Oil 


0 . 036 


0.010 0.412 0.014 


0 . 010 


0 . 019 


0 .004 


0 . 077 


0 . 016 


GR 


0 .000 


0 . 038 


0.000 0.877 0.000 


0 .000 


0 .000 


0 .000 


0 . 055 


0 .001 




Word presented to the system: (girl-buy-ball) hundred 
agent patie exper theme sourc goal benef 


cause 


instr 


value 


BP 


0 .001 


0 . 029 


0.002 0.146 0.006 


0 .005 


0 .006 


0 .002 


0 .080 


0.928 


GR 


0 .000 


0 .000 


0.000 0.046 0.000 


0 .000 


0 .000 


0 .000 


0 .002 


0.998 



Fig. 4. Outputs of the system for the sentence the girl bought a ball by a hundred dollars. Both 
modules arrived at the expected thematic grid [AGENT, THEME, VALUE], although in module GR 
the displayed values are closer to 1.0 



Sentence: the man hit the doll 








Word presented to the system: (man) hit 


agent patie exper theme sourc 


goal 


benef 


cause 


instr 


value 


BP 0.531 0.000 0.001 0.012 0.004 


0 .004 


0 .003 


0.768 


0 .003 


0 .000 


GR 0.542 0.000 0.000 0.004 0.000 


0 .000 


0 .000 


0.482 


0 .000 


0 .000 


Word presented to the system: (man-hit) doll 


agent patie exper theme sourc 


goal 


benef 


cause 


instr 


value 


BP 0.020 0.082 0.001 1.000 0.001 


0 . 000 


0 . 004 


0 . 001 


0 . 000 


0 . 000 


GR 0.000 0.079 0.000 0.881 0.000 


0 .000 


0 .000 


0 .000 


0 .000 


0 .000 



Fig. 5. Outputs of the system for the sentence the man hit the doll. Module GR arrived at the 
expected thematic grid [AGENT, theme], while module BP arrived at [CAUSE, theme] 



Sentence: the hammer broke the vase 



Word presented to the system: (hammer) break 





agent 


patie 


exper 


theme 


sourc 


goal 


benef 


cause 


instr 


value 


BP 


0.486 


0 .000 


0 .001 


0 .006 


0 .004 


0 .005 


0 .003 


0.764 


0 .004 


0 .000 


GR 


0 .093 


0 .000 


0 .000 


0 .005 


0 .000 


0 .000 


0 .000 


0.912 


0 .000 


0 .000 


Word presented to the system: (hammer-break) vase 












agent 


patie 


exper 


theme 


sourc 


goal 


benef 


cause 


instr 


value 


BP 


0 .001 


0.325 


0 . Oil 


0.313 


0 .004 


0 .004 


0 . 017 


0 .001 


0 . 077 


0 .003 


GR 


0 .000 


0.333 


0 .001 


0.536 


0 .000 


0 .000 


0 .000 


0 .000 


0 . 038 


0 .000 



Fig. 6. Outputs for the sentence the hammer broke the vase. Both modules arrived at the 
expected thematic role CAUSE for the subject. Concerning the object, while in BP there is a 
slight preference for PATIENT, in GR there is an unexpected assignment of theme 











398 



J.L. Garcia Rosa 



At last, figure 6 shows outputs for the sentence the hammer broke the vase, which 
employs another thematically ambiguous verb (break). To the subject of this sentence 
(hammer) is assigned the thematic role CAUSE instead of AGENT, because hammer 
causes the breaking, but it is not responsible for this action. Its semantic features 
include non-human and tool/utensil, incompatible to control of action, feature 
expected to be associated to the verb that assigns the thematic role AGENT. For the 
object, module BP worked better, assigning PATIENT to vase, instead of THEME. 



6 Conclusion 

The modules TRP-BP and TRP-GR, of the proposed system, are connectionist 
approaches to natural language processing, regarding the thematic role relationships 
between words of a sentence. The aim of this paper is to show that a biologically 
motivated connectionist system, with a bi-directional architecture and learning 
algorithm that uses only local information to update its synaptic weights, is able not 
only to take care of this natural language processing problem, but also to be more 
computationally efficient than the conventional backpropagation learning procedure 
through a simple recurrent connectionist architecture. This is confirmed by the 
outcomes for a same set of test sentences presented to both modules of the system. 
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Abstract. This paper presents a novel approach to question answering: the use 
of argumentation techniques. Our question answering system deals with 
argumentation in student essays: it sees an essay as an answer to a question and 
gauges its quality on the basis of the argumentation found in it. Thus, the 
system looks for expected types of argumentation in essays (i.e. the expectation 
is that the kind of argumentation in an essay is correlated to the type of 
question). Another key feature of our work is our proposed categorisation for 
argumentation in student essays, as opposed to categorisation of argumentation 
in research papers, where - unlike the case of student essays - it is relatively 
well-known which kind of argumentation can be found in specific sections. 



1 Introduction 

A new line of research in Question Answering that was discussed at the Symposium 
on New directions on Question Answering (Stanford University, spring 2003) is the 
use of knowledge in question answering. This knowledge - which might be encoded 
in ontologies - would, in our view, enhance the question answering process. Such a 
research direction has been already taken by the AQUA project [1][2] at the Open 
University, England. AQUA makes extensive use of knowledge (captured in an 
ontology) in several parts of the question answering process, such as query 
reformulation and similarity algorithm (assessing similarity between name of relations 
in the query and in the knowledge base). Currently, AQUA is coupled with the AKT 
reference ontology*, but in the future will handle several different ontologies. 

This paper proposes a somewhat different approach to question answering: here we 
use argumentation for finding answers in the specific domain of student essays. This 
means that specific categories of argumentation depend on the type of question. 
Current work is on how argumentation could be complemented with a reasoning 
system which will be able to decide on action plans in case an answer is not found. 
We also make use of knowledge in advising students about missing categories in the 
essays just like in Expert Systems. It should be noted that the aim of this work is not 
to produce a system that understands student essays as humans do. Our goal is a 



* The AKT reference ontology contains classes and instances of people, organizations, 
research areas, publications, technologies and events. 
(http://akt.open.ac.uk/ocml/domains/akt-support-ontology/) 

R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 400^09, 2004. 

© Springer-Verlag Berlin Heidelberg 2004 
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system that is able to locate the chunk(s) of text in which an answer to a question can 
be found. In our system, the user plays a key role: it is the user that performs answer 
validation and provides feedback to the system. 

Our test bed is a set of postgraduate student essays - a type of free or non- 
structured text - and corresponding essay questions. Therefore, in our domain, 
argumentation cannot be found in specific sections of the text (as in research papers). 
The first contribution of this paper is the use of argumentation techniques in the 
question answering problem, as opposed to conventional approaches to question 
answering such as information retrieval. Our second contribution is our argumentation 
categorisation for student essays: this is loosely based on research in argumentation in 
academic papers, but omits categories that are not applicable to this domain. More 
details on our categorisation can be found in section 1 and [3]. 

The paper is organised as follows: section 2 presents the question answering 
process model. Section 3 discusses the research background on argumentation 
schemas in papers and argument modelling and then introduces our essay 
metadiscourse categorisation in the context of the reviewed background. Section 4 
reports on our annotation categories and essay questions. Section 5 describes our 
testbed and actual matching of argumentation with essay questions. Section 6 reports 
preliminary results and indicates future work. Finally, section 7 draws our conclusion. 



2 Question Answering Process Model 

The proposed architecture (Figure 1) of our system comprises: interface, query 
classification, segmentation, categorization, reasoner and annotation modules. 

• The interface is a window menu interface. 

• The query classification module classifies queries as belonging to one of the types 
defined in our system. 

• The segmentation module obtains segments of student essays by using a library of 
cue phrases and patterns. 

• The categorisation component classifies the segments as one of our categories. 

• The reasoner is an expert system that will reason about categories found in a 
student essays. 

• The annotation module annotates relevant phrases as belonging to one of our 
defined categories. These annotations are saved as semantic tags. Future 
implementation may use machine learning for learning cue phrases. 

Our question answering system, the Student Essay System (SES), has a visual 
component - the Argumentation Viewer (AV) - which highlights instances of our 
argumentation categories in an essay, so as to give a visual representation of 
argumentation within an essay, in a shallow version of “making thinking 
visible”[4][5]. The intuition is that essays with considerably more “highlighting” 
contain more argumentation (and actual “content”) and therefore attract higher grades. 

The viewer can be used by tutors during assessment: they may refer to its 
automatic counts indicator, citation highlighting or simply use it to quickly gauge the 
amount and distribution of argumentation cues across an essay. AV can also provide 
formative feedback to students. Thus, if students running it on their own essay see 
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that little argumentation is found, they are well advised to “revise” their essay before 
submission. An improvement in the essay (more background and reasoned 
argumentation) should result in more highlighting, which may increase motivation in 
some students. 



Interface 




Fig. 1. Student Essay System Model 



3 Argument Modelling in Papers 

Relevant research background spans from articles on argumentation in research 
papers to knowledge representation tools supporting the construction of rhetorical 
arguments. An important strand of research has focused on paper structure, producing 
metadiscourse taxonomies applicable to research papers. In his CARS model, Swales 
[6] synthesised his findings that papers present three moves: authors first establish a 
territory (by claiming centrality, making topic generalisations and reviewing items of 
previous research), then they establish a niche (by counter-claiming, indicating a gap 
or question-raising) and finally they occupy this niche (by outlining purpose, 
announcing present research and principal findings and indicating paper structure). 
Although his analysis targeted only the introductory part of an academic research 
paper, his model has nevertheless been influential. For instance, Teufel [7] extended 
Swales’s CARS model by adding new moves to cover the other sections. They 
classify sentences into background, other, own, aim, textual, contrast and basic 
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categories. The authors claim that this methodology could be used in automatic text 
summarisation, since the latter requires finding important sentences in a source text 
by determining their most likely argumentative role. Their experiments showed that 
the annotation schema can be successfully applied by human annotators, with little 
training. 

Hyland [8] distinguishes between textual and interpersonal metadiscourse in 
academic texts. The former refers to devices allowing the recovery of the writer’s 
intention by explicitly establishing preferred interpretations; they also help form a 
coherent text by relating propositions to each other and to other texts. Textual 
metadiscourse includes logical connectives (in addition, but, therefore etc), frame 
markers (e.g. finally, to repeat, our aim here, endophoric markers (noted above, see 
Fig 2, table 1, below), evidentials (According to X, Y states) and code glosses 
(namely, e.g., in other words, such as). Interpersonal metadiscourse, instead, 
expresses the writer’s persona by alerting the reader to the author’s perspective to 
both the information and the readers themselves. Categories of interpersonal 
metadiscourse are hedges (might, perhaps, it is possible), emphatics (in fact, 
definitely, it is clear, obvious), attitude markers (Surprisingly, I agree), relational 
markers (Frankly, note that, you can see) and person markers (I, we, me, mine, our). 

Another interesting source is ScholOnto, an Open University project aiming to 
model arguments in academic papers and devise an ontology for scholarly 
discourse[9]. As part of their project, they developed ClaiMaker, a tool for browsing 
and editing claims. Claims are classified as general (e.g. is about, uses, applies, 
improves on), problem-related (e.g. addresses, solves), evidence (supports/ 
challenges), taxonomic, similarity and causal. ClaiMaker is meant for academic 
research papers, whereas we want an argumentation categorisation for student essays. 



3.1 Our Approach to Argumentation on Student Essays 

As a first step in our research, we identified candidate categories of argumentation 
in student essays through a preliminary manual analysis of essay texts. Some 
categories were influenced by ClaiMaker and the other categorisations seen above. 

Our bottom-up approach initially yielded the following argumentation categories: 
definition, comparison, general, critical thinking, reporting, viewpoint, problem, 
evidence, causal, taxonomic, content/expected and connectors. Some categories have 
sub-categories (e.g. connectors comprises topic introduction, inference, contrast, 
additive, support, reformulation and summative subcategories of connectors). 

A review of this schema prompted us to reduce the number of categories (cognitive 
overload, clearer visualisation). We thus grouped related categories and turned them 
into subcategories of a new category (e.g. evidence, causal and taxonomic became 
subcategories of the new “link” category) or modified categories (“viewpoint” 
merged into “positioning”, the new name for “critical thinking”). Our revised 
categorisation also sees comparison as part of definition, because we often define a 
concept by comparing it with others. The outcome of the rationalisation process is the 
following student essay categorisation: definition, reporting, positioning, strategy, 
problem, link, content/expected, connectors and general (Table 1). 

Compared to Teufel’s schema, ours lacks an AIM category: this is because all 
student essays have the implicit aim of answering the essay question. Similarly, we do 
not distinguish between OTHER and OWN (knowledge shared by author in other 
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papers and this paper respectively), as this does not apply to student essays. On the 
other hand, our content/expected category has no counterpart in the other 
categorisations, since it is a student essay-specific category comprising cue phrases 
identifying content that the tutor expects to find in the essay. Overall, however, there 
are remarkable similarities across these categorisations (for a comparison, see [3]). 



Table 1. Our Taxonomy for Argumentation in Student Essays 



Category 


Description 


Cue phrases (examples) 


DEFINITION 


Items relating to the definition of a term. 
Often towards the beginning. 

IS_ ABOUT, COMPARISONS 


is about, concerns, refers 
to, definition; is the same; 
is similar /analogous to; 


REPORTING 


Sentences describing other research in 
neutral way 


“X discusses”, “Y 

suggests”, “Z warns” 


POSITIONING 


Sentences critiquing other research; 
VIEWPOINTS 


“I accept”, “I am unhappy 
with”, “personally”; 


STRATEGY 


Explicit statements about the method or 
the textual section structure of the essay 


“I will attempt to”, “in 
section 2” 


PROBLEM 


Sentences indicating gap or inconsistency, 
question-raising, counter-claiming 


“There are difficulties”, 
“is problematic”, 

“limitations” 


LINK 


Statements indicating how categories of 
concepts relate to others: TAXONOMIC, 
EVIDENCE, CAUSAL 


“subclass of’, “example 
of’, “would seem to 
confirm”, “has caused” 


CONTENT/ 

EXPECTED 


Any concept that the tutor expects 
students to mention in their essay. Tutor- 
editable 


Essay-dependent 


CONNECTORS 


Links between propositions may serve 
different purposes (topic introduction, 
support, inference, additive, parallel, 
summative, contrast, reformulation) 


“With regard to”, “As to”, 
“Therefore”, “In facf’, “In 
addition”, “Overall”, 

“However”, “In short” 


GENERAL 


Generic association links 


“is related to” 



4 Annotation Categories and Essay Questions 

Query classification gives information about the kind of answer our system should 
expect. The classification phase involves processing the query to identify the category 
of answer being sought. In particular, sentence segmentation is carried out: this 
reveals nouns, verbs, prepositions and adjectives. The categories of possible answers, 
which are listed below, extend the universal categorisation used in traditional question 
answering systems (by adding to the six categories: what, who, when, which, why and 
where). Our analysis of the essay questions in our testbed (see Table 2 for questions 
and Section 5 for testbed) showed that they were answered by essays with different 
“link profiles” (see Table 3). 
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Table 2. Examples of Essay Questions 



Assignment 

1. Summary -I- Ass 1, part 2 

How and Why 

2. Opinion Ass 2, part 1 

about X 

Ass 4, part 3 



3. Describe -I- Ass 2, part 2 

Discuss 



Ass 4, part 2 

4. Give example of Ass 4, part 1 
X and Critique X 



Example 

“In the light of Otto Peter’s ideas... say how each 
type can or cannot serve these ideas and why” 

- “Who do you think should define the learners’ 
needs in distance education?” 

- “State and define your views on the questions of 
whether the research is adequately addressing 
what you regard to be the important questions or 
debates” 

“Imagine you are student and your teacher has a 
strong leaning towards the technical-vocational 
orientation. Describe and discuss your experiences, 
using concepts and examples from text book 1.” 
“Define and discuss any cultural factors you 
observe in relation to each of these questions” 
“Provide examples of web links covering a wide 
range of choose aspects of open and distance 
education and write a short critique of each.” 



The basic idea is that, depending on the essay question, we expect to find a 
different “distribution” of links in the essay themselves. For instance, a question 
asking for a “summary” is usually answered by an essay containing many “reporting” 
links. Table 3 matches essay questions with our essay metadiscourse categories 
(Table 2). We ran a statistical analysis of links and question types and our findings are 
presented in section 5. 



Table 3. Examples of Essay Questions and Expected Links 



Example of Question 

1 . Summary of X -l- 
How and Why 

2. Opinion about X 

3. Describe and Discuss 



4. Give an example of 
X and Critique X 



Links Expected to be Important in Essay 

Essays answering such questions have a high number of 
reporting, positioning, expected and contrast links. 

Essay has a high number of background, expected names, 
positioning links. 

These essays feature a high number of support and positioning 
links. In assignment 2, part 2, there was a low number of 
reporting links, as students were asked to describe a 
hypothetical situation; however, this may not always be the 
case. 

Here, analysis and summative connector links are higher than 
“is about” and “contrast” links. 



5 Test Bed 

Our testbed consists of 193 anonymised essays (belonging to 4 different assignments), 
with corresponding essay questions. The essays were anonymised versions of actual 
essays submitted by students as part of a Masters Course at the Open University. The 
essays were marked by three experienced tutors, whose comments we also consulted. 
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We chose student essays as our domain, as essays are typically less structured than 
other types of documents, including academic research papers, and therefore more 
difficult to work with. If our approach works with student essays, it will work even 
better with more structured types of texts. Also, while good results have been 
achieved in the area of student essays classification and assessment with statistical 
methods, these methods have no semantics and do not provide useful feedback, which 
is of course of particular importance in the area of student essays. 



a Student Essay Viewer - Microsoft Internet Explorer 



Ours os 



Hyland OS 



Swales OS 



An introduction to an area of study might begin with an attempt to define the nomenclature and to 
grasp the concepts, parameters and principles of that field But there are problems with this 
approach. Even the guru complains that "human institutions So not fit easi|intil typologies". 

And even if analysts can agree about defining "types" of educational provision, that 

the types within a particular definition have comparable characteristics. For example, the status, the 
quality control, assessment procedures and the access to technology of "distance education" courses 
provided by the M iHilllilKIJHIB are very different to those provided to the huge numbers of rural 
Chinese students at a Chinese IBI^IHIB I feel that, although both institutions can be defined as 
offering types of "distance education" courses, given their great disparity, it would be neither 
appropriate nor reliable to use them as equal representatives in a research cohort. In spite of these 
difficulties, Iwilnevertheless^tten® 9 summarise two terms used in Open and/or Distance 
Learning, and to provide two illustrative examples of both. 



LINK 

problem 

positioning 



CATEGORY 

problem 

positioning 



OCCURRENCES 

1 

2 



lei I I I I^MyComputei 

Fig. 2. Argumentation Viewer (AV) showing all annotations in an essay 



Having devised an argumentation schema for essays, we decided to implement a 
system to visualise argumentation in student essays. The resulting AV - while being 
as easy to use as a webpage - can be a time-saving tool for both tutors and students to 
use, thanks to its quick visualisation of argumentation in an essay. However, students 
may particularly benefit from a question analysis module: this could analyse and 
classify essay questions with respect to the type of argumentation required in the 
essay, thus allowing alerting students to missing (or lacking) categories of 
argumentation. This functionality would be very useful in a formative context and 
would certainly get the students to stop and think about whether what they are writing 
is answering the question, rather than simply “waffling on”. AV is thus part of SES, a 
question-answering tool, that tries to help create a satisfactory answer to a question 
and can alert the user if such satisfaction is not achieved. At this moment in time, we 
are working at the phrase level, but are hoping to move on to longer linguistic units 
(e.g. sentences and/or paragraphs) soon. 

We therefore determined what “link profiles” (Tables 2, 3) could reasonably be 
expected in a satisfactory essay written for assignments 1 and 2 and then performed a 
statistical analysis on the data in our possession to verify these hypotheses and find 
out the specific kind of argumentation that SES should be looking for relative to each 
type of question. The results of this analysis are summarised in Table 4. 
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Table 4. Expected and Actual Argumentation links in Assignments 1 and 2 



ID 


Expected 


Results 


Analysis 


Ass 1 
Part 1 


many 

reporting 

links 


- reporting links count significant 
(r=0.730; N=12; p<0.01) 

- positioning links count is not 

- total link count significant: 
r=0.624; N=12; p<0.05 
F(l, 10)= 6.385; p<0.05 


Both Spearman 

correlation and 

ANOVA F-statistic 

seems to support our 
expectations: reporting 
links are more 

important than 

positioning links in this 
type of essay. 


Ass 1 
Part 2a 


high number 
of reporting, 
positioning 
and expected 
links. 


- reporting more important than 
positioning 

- statistical significance for “specific 
reporting links”: 

- a) “Peters” r=0.744;n=12;p<0.01 

- b) “Peters-tindustrial-l-ODE” 
r=0.717;n=12;p<0.01 
E(l,14)=6.524; p<0.05 


Some students, while 
including sufficient 

reporting /expected 

links, managed to 
wander off topic (and 
hence their grade was 
not high). Better grades 
achieved by essays that 
stayed “on topic” 

(“specific reporting” 

links) 


Ass 1 
Part 2b 


- significant correlation between score 
and specific reporting links: 

- r=0.526;n=15;p<0.05 

(r=0.586 if we ignore references to 
“Holmberg”) 

- no statistical significance for generic 
reporting or positioning links 

- expected not significant 


Many students 

wandered off topic 
(discussed around 

Holmberg / expected 
stuff but not enough on 
guided didactic 

conversation or GDC). 
Hence, only reliable 
indicator is specific 
reporting links. 


Ass 2 
Part 1 


positioning 

links 

important 


- positioning links show a significant 
correlation with score: 
r=0.538;n=20;P,0.05 


When background is 
not “at the forefront” in 
an essay question, 
positioning tends to be 
the determinant link 
type. 


Ass 2 
Part 2 


-reporting 
(especially 
reporting on 
Schbn) 


- reporting links (generic): 
Spearman’s Rho: 

0.467; n=20; p<0.05; 

- specific reporting links 

r=0.541;n=20; p<005; 

- word count: 
r=0.639;n=20;p<0.01 


Reporting links are 
important in this kind 
of essay, particularly 
links directly connected 
to the question 

(students sometimes 

tended to wander off 
topic). Word count is 
important, again, as this 
is the last part in Ass2 
and some students 
overran their target in 
part 1. 
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While positioning links are determinant in Assignment 2 part 1, overall, the 
importance of reporting links is apparent: after all, essays at graduate and post- 
graduate level nearly always - to some extent - require showing that one has “done 
the reading”. Where reporting links were not significantly correlated with grade, this 
seems to be because students wandered off topic (e.g. they talked about Holmberg and 
his ideas at length, but did not spend most of their time and words on guided didactic 
conversation, which is what the question specifically asked about). This suggests that 
- in order to detect if an essay is answering the question (as opposed to going off 
topic) - our tool should make use of both a “generic” reporting link category and a 
more specific one (“specific reporting links” in Table 4), with instances derived from 
query classification techniques (such as sentence segmentation) applied to the essay 
query. Examples of cues used for “specific reporting” in Assignment 1 part 2a were: 
Peters, industrial and Open & Distance Learning. 



6 Results and Future Work 

Our main contribution is the application of argumentation techniques to question 
answering. A second contribution of this paper is our student essay metadiscourse 
schema, which we have compared and contrasted with categorisations in the research 
paper domain (Section 1). We have analysed links between argumentation in essays 
and score to determine if an essay is answering the question (Section 4) and gauge its 
overall quality. We found that the total number of links correlates with score, that 
positioning and background (expected + reporting) are the variables that generally 
contribute the most to score prediction and that the essay question is associated with 
the relative importance of different link types in an essay. We also found that 
“specific reporting links” are often needed to detect off-topic wanderings in student 
essays. 

We have implemented part of a question-answering system in the domain of 
student essays. SES is based on the idea that different essay profiles answer different 
types of questions and that therefore it can give useful feedback to students trying to 
answer an essay question. One of its components, AV, visualises the highlighted 
argumentation categories used in an essay: in particular, the type of argumentation 
and its concentration. Students may use SES to get feedback about their essay, 
particularly about lacking categories. 

In our investigation, we have used real data, actual essays written by postgraduate 
students as part of their course. We believe that the results reported here are 
encouraging in terms of the quality and robustness of our current implementation. 
However, there is clearly a lot more work needed to make this technology easy 
enough to use for tutors and students (who are neither experts in language 
technologies nor 'power knowledge engineers') to use. Future implementations of the 
student essay viewer could categorise longer linguistic units (e.g. sentences or 
paragraphs) and explain the reasons why a specific categorisation is assigned to them. 
These explanations might be displayed in pseudo-natural language. 

Future work includes implementation of an “essay question analysis module”. As 
this paper has shown, depending on the type of essay question asked, different types 
of argumentation are required to answer it and this is exactly where students tend to 
need the most help. The module will help analyse the question, establish what type of 
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argumentation is missing/lacking and will determine a set of “specific reporting links” 
for use to detect off-topic wanderings in essays. A reasoning system could then 
explain why the student is not answering the question and a visualisation component 
such as AV would be provided to display argumentation in student essays. 



7 Conclusion 

This paper has shown how argumentation techniques could be used successfully for 
finding answers to specific categories of questions. Moreover, it has briefly described 
our generic metadiscourse annotation schema for student essays and its links to other 
schemas specifying argumentation in academic papers. An argumentation 
visualisation tool for student essays has been introduced which uses our essay 
annotation schema and a cue-based approach to detect argumentation. 

Finally, the paper has explored some hypotheses as to how essay assessment and 
creation may be aided by the student essay viewer. In particular, thanks to its 
argumentation and question-answering approach, this tool may help students write 
essays that answer the essay question and give them formative feedback during their 
essay- writing efforts. 
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Abstract. Numerous methods have been developed for generating a 
machine translation (MT) bilingual dictionary from a parallel text cor- 
pus. Such methods extract bilingual collocations from sentence pairs of 
source and target language sentences. Then those collocations are regis- 
tered in an MT bilingual dictionary. Bilingual collocations are lexically 
corresponding pairs of parts extracted from sentence pairs. This paper 
describes a new method for automatic extraction of bilingual colloca- 
tions from a parallel text corpus using no linguistic knowledge. We use 
Recursive Chain-link-type Learning (RGL), which is a learning algo- 
rithm, to extract bilingual collocations. Our method offers two main 
advantages. One benefit is that this RGL system requires no linguistic 
knowledge. The other advantage is that it can extract many bilingual 
collocations, even if the frequency of appearance of the bilingual collo- 
cations is very low. Experimental results verify that our system extracts 
bilingual collocations efficiently. The extraction rate of bilingual colloca- 
tions was 74.9% for all bilingual collocations that corresponded to nouns 
in the parallel corpus. 



1 Introduction 

Recent years have brought the ability to obtain much information that is written 
in various languages using the Internet in real time. However, current machine 

R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 410-419, 2004. 
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translation (MT) systems can be used only for a limited number of languages. 
It is important for MT systems to build bilingual dictionaries. Therefore, many 
methods have been studied for automatic generation of an MT bilingual dictio- 
nary. Such methods are able to produce an MT bilingual dictionary by extracting 
bilingual collocations from a parallel corpus. Bilingual collocations are lexically 
corresponding pairs of parts extracted from sentence pairs of source and target 
language sentences. These studies can be classified into three areas of empha^ 
sis. Some use a linguistic-based approach [1]. In a linguistic- based approach, the 
system requires static, large-scale hnguistic knowledge to extract bilingual col- 
locations e.g., a general bilingual dictionary or syntax information. Therefore, it 
is difficult to apply such static large-scale linguistic knowledge to other various 
languages easily because developers must acquire linguistic knowledge for other 
languages. 

Other methods for extracting bilingual collocations include statistical ap- 
proaches [2, 3] . In these statistical approaches, it is difficult to extract bilingual 
collocations when the frequency of appearance of the bilingual collocations is 
very low, e.g., only one time. Therefore, the system requires a large bilingual 
corpus, or many corpora, to extract bilingual collocations. Typically, a statis- 
tical approach extracts only bilingual collocations that occur more than three 
times in the parallel corpus. A third type of method emphasizes the use of learn- 
ing algorithms that extract bilingual collocations from sentence pairs of source 
and target language sentences without requiring static linguistic knowledge. We 
have proposed a method using Inductive Learning with Genetic Algorithms 
(GA-IL) [4]. As shown in Fig. 1, this method uses a genetic algorithm to gener- 
ate two sentence pairs automatically with one different part of two source lan- 
guage sentences and with just one different part of two target language sentences. 
Unfortunately, this method requires similar sentence pairs as the condition of 
extraction of bilingual collocations. Therefore, these learning algorithm methods 
require numerous similar sentence pairs to extract many bilingual collocations, 
even though they require no ex ante static linguistic knowledge. 



(1) Generation of sentence pairs by applying genetic algorithms 



(He Hk9s 
(Sh e iikes 



ib 



Vtea. -.mil 
[^tennis. 



[Kare wa ocha ga suki dasu.]) 

® [Kanojo wa tenlsu ga suki desu.]) 



(She likes tea. ; ®iS:/lt/j33K/)!iWS'/UT . [Kanojo wa ocha ga suki desu.Jj 
(He likes tennis. ; [Kara wa tanisu ga suki desu.]) 



(2) Extraction of bilingual collocations by InduoBve Learning 
Generated sentence pair 

( He likes tennis . ; . [Kara wa tenlsu ga suki dasu.]) 



Given sentence pair 

(He likes !ea. ; #/lt /fejlt/ A</iiT#/'P'j~. [Kara wa ocha ga suki dasu.]) 



(tennlsiT— ^ [ten/su]),(tea;S35lf [ocha]) 



Fig. 1. Example of bilingual collocation extraction using GA-IL 
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We propose a new method for automatic extraction of bilingual collocations 
from a parallel corpus to overcome problems of existing approaches. Our method 
uses the learning algorithm we call Recursive Chain-hnk-type Learning (RCL) 
[5] to extract bilingual collocations efficiently using no linguistic knowledge. In 
this RCL system, various bilingual collocations are extracted efficiently using 
only character strings of previously-extracted bilingual collocations. This fea- 
ture engenders many benefits. This RCL system requires no static analytical 
knowledge, in contrast to a linguistic-based approach. Furthermore, in contrast 
to a statistical approach, it does not require a high frequency of appearance 
for bilingual collocations in the parallel corpus. This means that this RCL sys- 
tem can extract bilingual collocations from only a few sentence pairs. Numerous 
similar sentence pairs are unnecessary, in stark contrast to requirements of a 
learning-based approach. Evaluation experiment results demonstrate that this 
RCL system can extract useful bilingual collocations. We achieved a 74.9% ex- 
traction rate for bilingual collocations which correspond to nouns. Moreover, the 
extraction rate of bilingual collocations for which the frequency of appearance 
was only one in the parallel text corpus was 58.1%. 



2 Overview of Our Method 



Process 1 

Sentence pair of English and JapanesT" 

I walking the dog.; 

.{Baku no shfgpto vra Inu o sanpo sa seru koto desu.])' 
Bilingual-template 
(. Mv @0 is on the table. ;ii 

y/*-f . 

tWatashi no @0 wa teburu no ue ni ari masu.]) 

1 Extraction of word translation 
using bilingual t emplate 

Process 2 ^^XHob^M [s/i/go^ ^ 



Sentence palr»df English and'* Japanese 
(What’s your^? : 

[Anata no shiooto wa nan desu ka?\) 

1 Acquisition of bilingual template 
using word translation 

‘ s^What's your @0? ; /A'? 

[Anata no @0 wa nan desu ka^ 



Fig. 2. Diagram of the English- Japanese collocation extraction process 



The prominent feature of our method is that it does not require linguistic 
knowledge. We intend to realize a system based only on learning ability from 
the view of language acquisition of children. The RCL algorithm imitates part 
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of that principle in language acquisition because it requires no ex ante static 
linguistic knowledge. 

Figure 2 depicts the RCL process, which extracts English-Japanese collo- 
cations. This RCL system extracts two types of bilingual collocations. This is 
an English-Japanese collocation; (job; [shigotof). It can be registered in 
an MT bilingual dictionary as a word-level bilingual translation. Hereafter, we 
call this type of collocation a word translation. In contrast, phrases such as 
(What’s your @07; [Anata no @0 wa nan desu 

ka?\) are representative of an English-Japanese collocations that are used as a 
template for extraction of word translations. This type of collocation is called a 
bilingual template. In this paper, a word translation is a pair of source and 
target parts; a bilingual template is also a pair of source and target parts. Figure 
2 shows a process by which a word translation (job; [shigoto]) and a new 
bilingual template (What’s your @07; hti.tc/0) /i)''! [Anata no 
@0 wa nan desu ka?\) are extracted reciprocally. 

In process 1 of Fig. 2, this RCL system extracts (job; [shigoto]) as a 
word translation. This (job;f±^ [shigoto]) corresponds to the variables “@0” in 
the bilingual template (My @0 is on the table.; fotcb /O/@0/li/x“l/lh/®/ 
_h/{C !'$> *5 /'ST. [Watashi no @0 wa teburu no ue ni art masu.]). “My” and “is” 
adjoin the variable “@0” in the source part of the bilingual template. They are 
shared parts with the parts in the English sentence “My job is walking the dog.” 
Moreover, “0 [no]" and [wa]" adjoin the variable “®0” in the target part 
of the bilingual template; they are also shared parts with parts in the Japanese 
sentence “IS < /-&?)/ Cl t. /TT. [Baku no shigoto wa 
inu o sanpo sa seru koto desu.]" Therefore, this RCL system extracts the (job; 

[shigoto]) by extracting “job” between the right of “My” and the left of 
“is” in the English sentence, and extracting [shigoto]" between the right 

of “0 [no]" and the left of “IS [toa]” in the Japanese sentence. 

Moreover, this RCL system acquires new bilingual templates using only char- 
acter strings of the extracted (job; [shigotd^. In process 2 of Fig. 2, the 
source part “job” of the word translation (job; [shigotc^^ has the same char- 
acter strings as the part in the English sentence “What’s your job7” In addition, 
the target part “tt^ [shigoto^' of the word translation (job; [shigoto\) has 
the same character strings as the part in the Japanese sentence “S)^/t/0 / {db^/ 
/A' 7 [Anata no shigoto wa nan desu ka?[.” Therefore, this RCL sys- 
tem acquires (What’s your @07; /tkl [Anata no @0 wa 

nan desu ka?]) as the bilingual template by replacing “job” and [shigoto]" 

with the variables “®0” for the sentence pair (What’s your job7; S)^fc/0/'tfc 
/tk7 [Anata no shigoto wa nan desu ka?]). 

Extracted word translations and bilingual templates are applied for other 
sentence pairs of English and Japanese to extract new ones. Therefore, word 

® Italics express pronunciation in Japanese. 

® “/” in Japanese sentences are inserted after each morpheme because Japanese is an 
agglutinative language. This process is performed automatically according to this 
system’s learning method [6] , without requiring any static linguistic knowledge. 
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translations and bilingual templates are extracted reciprocally as a link ed chain. 
A characteristic of our method is that both word translations and bilingual 
templates are extracted efficiently using only character strings of sentence pairs 
of the source and target language sentences. Thereby, our system can extract 
bilingual collocations using no linguistic knowledge, even in cases where such 
collocations appear only a few times in the corpus. Figure 2 shows that (job; 
f±^ [shigoto]) was extractable even though it appears only one time. 

3 Outline 

Figure 3 shows an outline of this RCL system’s extraction of bilingual colloca- 
tions from sentence pairs of source and target language sentences. First, a user 
inputs a sentence pair. In the feedback process, this RCL system evaluates ex- 
tracted word translations and bilingual templates using the given sentence pairs. 
The user does not evaluate word translations and bilingual templates directly. 
In the learning process, word translations and bilingual templates are extracted 
automatically using two learning algorithms; RCL and GA-IL. In this study, this 
RCL system extracts English- Japanese collocations. 




Fig. 3. Process flow 



4 Process 

4.1 Feedback Process 

Our system extracts not only correct bilingual collocations, but also erroneous 
bilingual collocations. Therefore, this RCL system evaluates bilingual colloca- 
tions in the feedback process. In this paper, correct bilingual collocation denotes 
a situation in which source parts and target parts correspond to each other; er- 
roneous bilingual collocations are cases where source parts and target parts do 
not correspond to one another. In the feedback process, this RCL system first 
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generates sentence pairs in which source language sentences have the same char- 
acter strings as source language sentences of the given sentence pairs. It does 
so by combining bilingual templates with word translations. Consequently, this 
RCL system can generate sentence pairs in which the English sentences have the 
same character strings as the English sentences of given sentence pairs. 

Subsequently, this RCL system compares Japanese sentences of the generated 
sentence pairs with Japanese sentences of given sentence pairs. When Japanese 
sentences of generated sentence pairs have the same character strings as the 
Japanese sentences of given sentence pairs, word translations and bihngual tem- 
plates used to generate sentence pairs are determined to be correct. In this 
case, this RCL system adds one point to the correct frequency of the used word 
translations and bilingual templates. On the other hand, word translations and 
bilingual templates used to generate sentence pairs are designated as erroneous 
when Japanese sentences of generated sentence pairs have different character 
strings from Japanese sentences of given sentence pairs. In that case, this RCL 
system adds one point to the error frequency of used word translations and 
bilingual templates. Using the correct frequency and error frequency, this RCL 
system calculates the Correct Rate (CR) for the word translations and bilin- 
gual templates that were used. Following is a definition of CR. This RCL system 
evaluates word translations and bilingual templates automatically using CR. 



CR (%) = Correct frequency ^ 

Correct frequency -f- Error frequency 

4.2 Learning Process 

Word translations and bilingual templates are extracted reciprocally by this 
RCL system. We first describe the extraction process of word translations using 
bilingual templates as in process 1 of Fig. 2. Details of this process are: 

(1) This RCL system selects sentence pairs that have the same parts as those 
parts that adjoin variables in bilingual templates. 

(2) This RCL system obtains word translations by extracting parts that adjoin 
common parts, which are the same parts as those in bilingual templates, 
from sentence pairs. This means that parts extracted from sentence pairs 
correspond to variables in bilingual templates. In the extraction process, 
there are three patterns from the view of the position of variables and their 
adjoining words in bilingual templates. 

Pattern 1: When common parts exist on both the right and left sides of 
variables in source or target parts of bilingual templates, this RCL sys- 
tem extracts parts between two common parts from source language 
sentences or target language sentences. 

Pattern 2: When common parts exist only on the right side of variables 
in source parts or target parts of bilingual templates, this RCL system 
extracts parts from words at the beginning of the sentence to words of 
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the left sides of common parts in source language sentences or target 
language sentences. 

Pattern 3: When conunon parts exist only on the left side of variables in 
source parts or target parts of bilingual templates, this RCL system 
extracts parts from words of the right sides of common parts to words 
at the end in source language sentences or target language sentences. 

(3) This RCL system yields CR that are identical to those bilingual templates 
used to extract word translations. 

In addition, we describe the acquisition process of bilingual templates using 
word translations as in process 2 of Fig. 2. Details of this process are; 

(1) This RCL system selects word translations in which source parts have iden- 
tical character strings to those parts in source language sentences of sentence 
pairs, and in which target parts have the same character strings as parts in 
the target language sentences of sentence pairs. 

(2) This RCL system acquires bilingual templates by replacing common parts, 
which are identical to word translations, with variables. 

(3) This RCL system yields CR that are identical to those word translations 
used to acquire bihngual templates. 

On the other hand, word translations or bilingual templates that are used as 
starting points in the extraction process of new ones are extracted using GA-IL. 
The reason for using GA-IL is that our system can extract bilingual collocations 
using only a learning algorithm with no static linguistic knowledge. In this study, 
our system uses both RCL and GA-IL. 

5 Experiments for Performance Evaluation 

5.1 Experimental Procedure 

To evaluate this RCL system, 2,856 English and Japanese sentence pairs were 
used as experimental data. These sentence pairs were taken from five textbooks 
for first and second grade junior high school students. The total number of 
characters of the 2,856 sentence pairs is 142,592. The average number of words 
in English sentences in the 2,856 sentence pairs is 6.0. All sentence pairs are 
processed by our system based on the outline described in Section 3 and based 
on the process described in Section 4. The dictionary is initially empty. 

6.2 Evaluation Standards 

We evaluated all extracted word translations that corresponded to nouns. Ex- 
tracted word translations are ranked when several different target parts are ob- 
tained for the same source parts. In that case, word translations are sorted so 
that word translations which have the highest CR described in Section 4.1 are 
ranked at the top. Among ranked word translations, three word-translations, 
ranked from No. 1 to No. 3, are evaluated by the user as to whether word trans- 
lations where source parts and target parts correspond to each other are included 
in those three ranked translations or not. 
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5.3 Experimental Results 



Table 1. Extraction rate of this R.CL system 



Extraction 


Detail 


rate 


noiins 


compound nouns 


74.9% (347) 


75.5% (330) 


65.4% (17) 



There are 463 kinds of nouns and compound nouns in the evaluation data: 
437 varieties of nouns and 26 varieties of compound nouns. Table 1 shows the 
extraction rate of this RCL system in the evaluation data. In Table 1, values in 
parentheses indicate the number of correct word translations extracted by this 
RCL system. Moreover, in this paper, a system using only GA-IL is used for com- 
parison to this RCL system. It is difficult to make comparisons among methods 
[1-3] which extract bilingual collocation because they typically use various static 
linguistic knowledge. In the system using only GA-IL, the extraction rate for the 
word translations which corresponded to nouns and compound nouns was 58.7%. 
Therefore, using RCL, the extraction rate improved from 58.7% to 74.9%. The 
extraction rate of word translations for which the frequency of appearance is 
only one time in the parallel text corpus improved from 32.6% to 58.1% through 
use of RCL. Table 2 shows examples of the extracted correct word translations. 



Table 2 . Examples of extracted correct word translations 



English 


Japanese 


museum 


[hakiibutsukan\ 


machine 


[kikai] 


sumo 


^ [sumo] 


-ft 0 


means a Japanese traditional sport. 


Statue of Liberty 


[jiyu no megami] 


Alice in Wonderland 


TlSIfl/tD/H/^/T y A [fushigi no kuni no ansu] 


electric guitar 


[ereki git^ 



5.4 Discussion 

We confirmed that this RCL system can extract word translations without re- 
quiring a high frequency of appearances of word translations. Figure 4 shows 
the change in extraction rates engendered by this RCL system and the system 
using only GA-IL for every 100 word translations that correspond to nouns and 
compound nouns in the 2,856 sentence pairs used as evaluation data. Figure 4 
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shows 463 word translations that correspond to nouns and compound nouns. 
The translations are arranged by appearance sequence in 2,856 sentence pairs. 



• Extraction rate of this system using RCL 




1 00 200 300 400 463 

Number of word translations that correspond to nouns 
and compound nouns in a parallel corpus 



Fig. 4. Change of extraction rates and the average frequency of appearance of extracted 
word translations in the parallel corpus 



In Fig. 4, the dotted line shows average frequencies of appearance of word 
translations for every 100 word translations that correspond to nouns and com- 
pound nouns in evaluation data. The average frequencies of appearance of word 
translations between Nos. 1 and 100 are high because such word translations 
appear in many other sentence pairs. The average frequency of appearance of 
word translations between Nos. 1 and 100 is 13.0. In general, the system extracts 
word translations easily when their frequency of appearance is high because the 
probability that the system can extract them is relatively high. Consequently, 
the extraction rate of word translations between Nos. 1 and 100 is higher than 
for those in other parts of Fig. 4. On the other hand, the average frequency of 
appearance of word translations between Nos. 401 and 463 is low because such 
word translations do not appear in any other sentence pairs. The average fre- 
quency of appearance of word translations between Nos. 401 and 463 is 2.4. In 
general, it is difficult for the system to extract word translations when their fre- 
quency of appearance is low because the probability that the system can extract 
them is relatively low. 

Figure 4 depicts the extraction rate of the system using only GA-IL. The rate 
decreases rapidly as the frequency of appearance of word translations decreases. 
In contrast, the extraction rate of this RCL system is almost fiat except between 
Nos. 1 and 100. In this RCL system, the decrement of the extraction rate is 
only nine points between Nos. 101 and 463. In the system using only GA-IL, 
the decrement of the extraction rate is 23 points between Nos. 101 and 463. 
These results imply that this RCL system can extract many word translations 
efficiently without requiring a high frequency of appearance of word translations. 
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On the other hand, erroneous word translations are also extracted in this RCL 
system. The precision of extracted word translations was 47.3%. This precision 
is insufficient. However, in the feedback process described in Section 4.1, this 
RCL system can evaluate these word translations as erroneous word translations. 
The rate at which the system could determine erroneous word translations for 
extracted erroneous word translations was 69.2%. In that case, erroneous word 
translations mean word translations whose CR is under 50.0%. 

6 Conclusion 

This paper proposed a new method for automatic extraction of bilingual collo- 
cations using Recursive Chain-link- type Learning (RCL). In this RCL system, 
various bilingual collocations are extracted efficiently using only character strings 
of previously-extracted bilingual collocations. Moreover, word translations and 
bilingual templates are extracted reciprocally, as with a linked chain. Therefore, 
this RCL system can extract many word translations efficiently from sentence 
pairs without requiring any static linguistic knowledge or when confronting cor- 
pus which contain words with very low frequency of appearance. This study 
demonstrates that our method is very effective for extracting word translations 
and thereby building an MT bilingual dictionary. 

Future studies will undertake more evaluation experiments using practical 
data. We also intend to confirm that this RCL system can extract bilingual 
collocations from sentence pairs using other languages. We infer that this RCL 
system is a learning algorithm that is independent of a specific language. More- 
over, we will apply RCL to other natural language processing systems, e.g., a 
dialog system, to confirm RCL effectiveness. 
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Abstract. Proper name recognition is a subtask of Name Entity Recognition in 
Message Understanding Conference. For our corpus annotation proper name 
recognition is a crucial task since proper names appear approximately in more 
than 50% of total sentences of the electronic texts that we collected for such 
purpose. Our work is focused on composite proper names (names with 
coordinated constituents, names with several prepositional phrases, and names 
of songs, books, movies, etc.) We describe a method based on heterogeneous 
knowledge and simple resources, and the preliminary obtained results. 



1 Introduction 

A big corpus is being compiled by our research group. Since we defined its size in 
tenths of million words and its objective as unrestricted text analysis, the easiest and 
quickest manner to obtain texts was extracting electronic texts from Internet. We 
selected four Mexican newspapers daily published in the Web with a high proportion 
of their paper publication. We found that almost 50% of the total unknown words 
were proper names. This percentage shows the relevance of proper name recognition 
and it justifies a more wide analysis. 

Proper names have been studied in the field of Information Extraction [15] for 
diverse uses. For example [5] employed proper names for an automatic newspaper 
article classification. Information Extraction requires the robust handle of proper 
names for successful performance in diverse tasks as pattern filling with correct 
entities that perform semantic roles [11]. The research fulfilled in the Message 
Understanding Conference (MUG) structure entity name task and it distinguishes 
three types: ENAMEX, TIMEX and NUMEX [4]. ENAMEX considers entities such 
as organizations (corporations names, government entities, and other type of 
organizations), persons (persons names, last names), and localities (localities names 
defined politically or geographically: cities, provinces, countries, mountains, etc.). 
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In this paper, we are concerned with ENAMEX entity recognition hut we focused 
our work on composite named entities: names with coordinated constituents, names 
with several prepositional phrases, and names of songs, hooks, movies, etc. We 
postponed name classification to future work. 

NER works in MUC have been dedicated to English language, they considered 
complex tools or huge resources. For example, in [12] three modules were used for 
name recognition: List Lookup (consulting lists of likely names and name cues). Part 
of speech tagger. Name parsing (using a collection of specialized name entity 
grammars), and Name-matching (the names identified in the text are compared 
against all unidentified sequences of proper nouns produced by the part of speech 
tagger). The system of [14] recognized named entities by matching the input against 
pre-stored lists of named entities, other systems use gazetteers (lists of names, 
organizations, locations and other name entities) of very different sizes, from 110,000 
names (MUC-7) to 25,000-9,000 names [6]. 

NER works in Language-Independent Named Entity Recognition, the shared task 
of Computational Natural Language Learning (CoNLL) covered Spanish in 2002 
[13]. A wide variety of machine learning techniques were used and good results were 
obtained for name entity classification. However composite names were limited: 
named entities are non-recursive and non-overlapping, in case a named entity is 
embedded in another name entity only the top level entity was marked, and only one 
coordinated name appears in the training file. 

Since named entities recognition is a difficult task our method is heterogeneous; it 
is based on local context, linguistic restrictions, statistical heuristics and the use of 
lists for disambiguation (very small external lists of proper names, one of similes and 
lists of non ambiguous entities taken from the corpus itself). In this article, we present 
the text analysis carried out to determine the occurrence of named entities, then we 
detailed our method and finally we present the obtained results. 



2 Named Entities in Newspaper Texts 

Two aspects should be considered in named entities recognition: known names 
recognizing and new names discovering. However, newspaper texts contain a great 
quantity of named entities; most of them are unknown names. Since named entities 
belong to open class of words, entities as commercial companies are being created 
daily, unknown names are becoming important when the entities they referred to 
became topical or fashioned. 

We analyzed Mexican newspaper texts that were compiled from the Web. They 
correspond to four different Mexican newspaper, between 1998 and 2001. From the 
analysis, we concluded that almost 50% of total words were unknown words*. We 
found 168,333 different words that were candidates to be named entities since they 
were initialized or totally fulfilled with capital letters. These capitalized words 
represent a low percentage from all different words but they appear at least in 50% of 
the sentences. We present such statistics in Table 1. From those numbers we note the 
importance of named entities for syntactic analysis of unrestricted texts since 50% to 
60% of total sentences include named entities. 



* They were not recognized by our resources: a dictionary with POS and a spelling checker. 
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Table l.Statistics of newspaper texts 





Newspapers 




#1 


#2 


#3 


#4 


# Words 
# Sentences 

# Sentences w/ named entities 


87,597,168 

2,927,723 

1,581,225 


38,387,767 

1,328,157 

729,496 


5,652,358 

208,298 

100,602 


45,702,200 

1,696,358 

1,007,051 



The initial step, for recognition of named entities was identification of context and 
style. We selected one Mexican newspaper, since we supposed that all newspapers 
present the named entities in similar manner. We analyzed newspaper #2 and we 
found that named entities are introduced or defined by means of syntactic-semantic 
characteristics and local context. The main characteristics observed were: 

Conventions. Specific words could introduce names, for example: coordinadora 
del proerama Mundo Maya (Mundo Maya program’s coordinator ), subsecretario de 
Operacion Energetica de la Secretaria de Energia ( sub secretary of. . .), etc. 

Redundancy. Information obtained from juxtaposition of named entities and 
acronyms, for example: Asociacion Rural de Interes Colectivo {ARIC), two names 
linked for the same entity by means of specific words: alias, (a), for example: ... 
dinero de Amado Carrillo Fuentes alias El Sehor de los Cielos... 

Prepositions usage. We consider two cases: 

1 . Prepositions link two different named entities. For example “a” indicates direction 
(Salina Cruz a Juchitdn); “en” indicates a specific location {Tratado sobre Armas 
Convencionales en Europa), etc. 

2. Prepositions are included in the named entities {Institute para la Proteccion al 
Ahorro Bancario, Monumento a la Independencia, Centro de Investigaciones y 
Estudios Superiores en Antropologia Social). 

Local context. Named entities are surrounded by local context signs. They could 
be used for identification of book, song and movie names. For example: Marx 
escribio La ideologia alemana (Marx wrote ...): ... titulado La celebracion de muertos 
en Mexico ( titled ...), etc. Some verbs (read, write, sing, etc.), some nouns (book, song, 
thesis, etc.), or proper names of authors often introduce or delimit such kind of names 
as those underlined. 

Sets of names. Named entities could appear as sets of capitalized words. 
Punctuation (,;) is used to separate them, for example: Bolivia, Brasil, Uruguay, 
Ecuador, Panama, or ... de Charles de Gaulle y Gagarin; de Juan Pablo II; de Eva 
Perdn... 

Flexibility. Long named entities do not appear as fixed forms. For ex.: Institute 
para la Proteccion al Ahorro, Institute para la Proteccion al Ahorro Bancario, 
Institute para la Proteccion del Ahorro Bancario, all of them correspond to the same 
entity. More variety exists for those names translated from foreign languages to 
Spanish. 

Coordinated names. Some named entities include conjunctions (y, e, etc.) For 
example.' Luz y Fuerza del Centro, Margarita Dieguez y Armas, Ley de Armas de 
Fuego y Explosives, Institute Nacional de Antropologia e Historia. 
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Concept names. Some capitalized words represent abstract entities. In a strict 
sense they could not be considered as named entities and they should be tagged with a 
different semantic tag. For example: Las violaciones a la Ley en que algunos ... (The 
violations to the Law in which some ...). Snch kind of entities should be differentiate 
from those names representing an abbreviation of longer names (for example: Ley del 
Seguro Social) in a deep understanding level. 



3 Named Entities Analysis 

In order to analyze how named entities could be recognized by means of linguistics 
and context rules or heuristics we separate newspaper #2 sentences in two groups: 

1 . sentences with only one initial capitalized word, and 

2. sentences with more than one capitalized word. 

Group 1 could not contain named entities since the first word in each sentence 
could be name entity or one common word. [8] proposed an approach to disambignate 
capitalized words when they appear in the positions where capitalization is expected. 
Their method ntilize information of entire docnments to dynamically infer the 
disambignation clues. Since we have a big collection of texts we could apply the same 
idea. 

We concentrated our analysis in group 2. We built a Perl program that extracts 
groups of words that we call “compounds”, they really are the contexts when named 
entities conld appear. The componnds contain no more than three non capitalized 
words between capitalized words. We supposed that they should correspond to 
functional words (prepositions, articles, conjunctions, etc.) in composite named 
entities (coordinated names and names with several prepositional phrases). The 
componnds are left and right limited by a punctuation mark and a word if they exist. 
For example, for the following sentence: 

Lfn informe oficial aseguro que Cuba invierte anualmente cerca de 100 millones de 
dolares en tecnologias informdticas y que en el trabajo para enfrentar al error del 
milenio, el pals participo intensamente en el Grupo Regional de Mexico. 
Centroamerica y el Caribe, con apoyo del Centro de Cooperacion Internacional Y2K 
que funciona en Washington y que fue creado por la Organizacion de Naciones 
Unidas. 

We obtained the following compounds: 

• que Cuba invierte 

• el Grupo Regional de Mexico, Centroamerica y el Caribe, 

• del Centro de Cooperacion Internacional Y2K que funciona en Washington y 

• la Organizacion de Naciones Unidas. 

From 723,589 sentences of newspaper #2, 1348,387 componnds were obtained. 
We analyzed randomly approximately 500 sentences and we encountered the main 
problems that onr method shonld cope with. They are described in the following 
sections. 
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3.1 Paragraph Splitting 

We observed two problems in paragraph splitting: 1) sentences that should be 
separated and 2) sentences wrong separated. The causes of such errors were: 

1. Punctuation marks. Sentences ending with quotation marks and leaders. For ex.: 

“ pelicula personal. ” A pesar de 

It is a competence error since in Spanish the point appears before quotation marks 
when the whole sentence is wanted to be marked. 

2. Abbreviations. For ex., in the following phrase “Arq.” corresponds to “architect”: 

ante las cdmaras de television, el Arq. Hector E. Herrera Leon 

[9] consider several methods for determining English abbreviations in annotated 
corpus: combinations of guessing heuristics, lexical lookup and the document- 
centered approach. We only consider the first method to automatically obtain a list of 
abbreviations from newspaper #2. They were obtained with heuristics such as: 
abbreviations have length less than five characters, they appear after a capitalized 
word between commas, etc. They mainly correspond to professions and Mexican 
states. 

3. Style. Some sentences show an unclear style. For example, the use of parenthesis 
de nadie. (Y aunque muchos sabemos que los asaltos estdn tambien a la orden 

del dia, precisamente en el dia.) No hace mucho 

The traditional Spanish use of parenthesis is the isolation of a small sentence part. 



3.2 Syntactic Ambiguity 

We found three main syntactic ambiguities in compounds, introduced by 
coordination, prepositional phrase attachment, and named entities composed of 
several words where only the first word is capitalized. 

The last one corresponds to titles of songs, movies, books, etc. For example: Ya en 
El perro andaluz. su primer filme ... (Already in The Andalusia dog , his first movie.) 
The titles appearing in the electronic texts begin with one capitalized word followed 
by several non capitalized words, and sometimes another name entity embedded. As 
far as we observed, there are no use of punctuation marks to defined them. This use is 
different to that considered in the CoNLL-2002 training file, where the included titles 
are delimited hy quotation marks. 

Recognition of named entities related to coordination and prepositional phrase 
attachment is crucial for our objective: unrestricted text analysis. For all singular 
conjunction cases, dependency grammars assign the following structure to 
coordinated structures in the surface level: (—>) PI -4 C -4 P2, where PI is the sub 
tree root. In the simpler and more usual case, the components PI and P2 with the 
conjunction cover named entities. For example Luz y Euerza. However, there are 
other cases where the coordinated pair is a sub-structure of the entire name, for 
example: Mesa de [Cultura y Derechos] Indigenas. 

The following compounds shows the ambiguity introduced by coordination (the 
second component is underlined): 

• Comisidn Federal de Electricidad y Luz v Fuerz.a del Centro , includes two 
organization names. 

• Margarita Dieguez y Armas y Carlos Virgilio . includes two personal names. 

• Comunicaciones y Transportes y Hacienda includes two organization names. 
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Some compound examples of single names containing coordinated words: 

• Common Nacional Bancaria y de Valores 

• Subsecretario de Planeacion y Finanzas 

• Teatro y Danza de la UNAM 

Prepositional phrase attachment is a difficult task in syntactic analysis. Named 
entities present similar problem. We consider a diverse criterion than that considered 
in CoNLL: in case a named entity is embedded in another name entity or in case a 
named entity is composed of several entities all the components should be determined 
since syntactic analysis should find their relations for deep understanding in higher 
levels of analysis. For example: 

1. Teatro y Danza de la UNAM (UNAM’s Theater and Dance) 

Teatro y Danza is a cultural department of a superior entity. 

2. Comandancia General del Ejercito Zapatista de Liberacion Nacional 
Comandancia General (General command) is the command of an entity (army). 

A specific grammar for named entities should cope with the already known 
prepositional phrase attachment problem. Therefore diverse knowledge described in 
the following section must be included to decide the splitting or joining of named 
entities with prepositional phrases. 



3.3 Discourse Structures 

Discourse structures could be another source for knowledge acquisition. Entities 
could be extracted from the analysis of particular sequences of texts. We are 
particularly interested in 

1. Enumeration that can be easily localized by the presence of similar entities, 
separated by connectors (commas, subordinating conjunction, etc). For example, in 
the following sequence: 

La Paz, Santa Cruz y Cochabamba 

Jose Arellano, Marco A. Meda, Ana Palencia Garcia y Viola Delgado 

2. Emphasizing words or phrases by means of quotation marks and capitalized words. 
Eor example: “Emilio Chichifet”, “Roberto Madrazo es el Cuello”, “Gusano 

Gracias”, are parodies of well known names and the sentence author denote it by 
quotation marks. 

3. Author’s intension. A specific intension could be denoted by capitalized words 
since author chose the relation in the structure. Eor example, “Convent” in : 

una visita al antiguo Convento de la Encarnacion, ubicado en () 

y asi surgio el convento de Nuestra Sehora de Balvanera. () 

The first one shows the author’s intension to denote the whole name of the building 
covering its old purpose. The author’s intension in the second one is to make evident 
to whom is dedicated the convent. 



4 Method 

We conclude on our analysis that a method to identify named entities in our electronic 
texts collection should be based mainly on the typical structure of Spanish named 
entities themselves, on their syntactic-semantic context, on discourse factors and on 
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knowledge of specific composite named entities. Then, our method consists of 
heterogeneous knowledge contributions. 

Local context. Local context has been considered in different tasks. For example, [7] 
used it for semantic attribute identification in new names. We consider local context 
to identify names of songs, books, movies, etc. For such purpose a window of two 
words preceding the capitalized word was defined. In such window a word appearing 
in a manually compiled list of 26 items plus synonyms and variants (feminine, 
masculine, plural) was considered a cue that introduce a name of song, book, etc. For 
example: 

• En su libro La razon de mi vida {Editorial Pax)... (In his book The reason of my 
life ( Pax publisher) 

• ...comence a releer La edad de la discrecion de Simone de Beauvoir,.. . (I began to 
reread Simone de Beauvoir’s The age of discretion,) 

• ...en su programa Una ciudad para todos que... (in his program A city for all that) 
Some heuristics were determined to obtain the complete name: all posterior words 

are linked until a specific word or punctuation sign is found. Such word or 
punctuation sign could be: 1) a name entity, 2) any sign of punctuation in texts 
(period, comma, semicolon, etc.) and 3) a conjunction. 

In the above examples the signs: “(“, “Simone de Beauvoir” and the conjunction 
“que” delimit the names. For more complex cases, statistics are included. 

The phrases delimited by quotation marks preceded by a cue were also considered 
as names of songs, books, movies, etc. 

Linguistic knowledge. We mainly consider the preposition use, part of speech of 
words linking groups of capitalized words, and punctuation rules. The linguistic 
knowledge is settled in linguistic restrictions. For example: 

1. Lists of groups of capitalized words are similar entities. Then an unknown name 
have similar category and the last one should be a different entity coordinated by 
conjunction. For example: Corea del Sur, Taiwan, Checoslovaquia y Suddfrica. 

2. Preposition use, considering the meaning of prepositions for localization, direction, 
etc. For example: 

Preposition “por” followed by a undetermined article cannot link groups of person 
names. For example the compound: Juan Ramon de la Fuente por la Federacidn de 
Colegios de Personal Academico must be divided in Juan Ramon de la Fuente and 
Federacidn de Colegios de Personal Academico. Therefore, the compound Alianza 
por la Ciudad de Mexico could correspond to a single name. 

Two named entities joined by preposition “a” should be separated if they are 
preceded by preposition indicating an origin position (“de”, “desde”). For example: de 
Oaxaca a Salina Cruz. 

Heuristics. Some heuristics were considered to separate compounds. 

1. Two capitalized words belonging to different list must be separated. For 
example: “...en Chetumal Mario Rendon dijo ...”, where Chetumal is an item 
of main cities list and Mario is an item of personal names list. 

2. One personal name should not be coordinated in a single name entity. For 
example: Margarita Dieguez y Armas y Carlos Virgilio, where Carlos is an 
item of personal name list. 
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3. A group of capitalized words with functional words followed by an acronym 
should be defined a single name if most of initial letters are in the acronym. 
For example: FIFA Federacion Internacional de la Asociacion de Futbol, 
OAA Administracion Americana para la Vejez. 

4. All capitalized words grouped by quotation marks, without punctuation 
marks, are considered one name entity. For example: "Adolfo Lopez Mateos" . 



Statistics. From newspaper #2 we obtained the statistics of groups of capitalized 
words, one single word to three contiguous words, and groups of capitalized words 
related to acronyms. The top statistics for such groups were used to disambiguate 
compounds joined by 

• Functional words. For example, the compound Estados Unidos sobre Mexico could 
be separated in Estados Unidos (a 2-word group with high score) and Mexico (a 1- 
word with high score). In the same manner the compound BP Amoco Pic is kept as 
is and ACNUR Kris Janowski is separated in ACNUR and Kris Janowski. 

• Prepositions. For example: Comandancia General del Ejercito Zapatista de 
Liberacion Nacional could be separated in : Comandancia General and Ejercito 
Zapatista de Liberacion Nacional. 

Many NER systems use lists of names, for example [6] made extensive use of 
name lists in their system. They found that reducing their size by more than 90% had 
little effect on performance, conversely adding just 42 entries led to improved results. 
[10] experimented with different types of lists in an NER system entered for MUC7. 
They concluded that small lists of carefully selected names are as effective as more 
complete lists. 

The lists of names used by named entity systems have not generally been derived 
directly from text but have been gathered from a variety of sources. For example, [2] 
used several name lists gathered from web sites containing lists of people first names, 
companies and locations. We also included lists from internet, and a hand made list of 
similes [1] (stable coordinated pairs) for example: comentarios y sugerencias, noche y 
di'a, tarde o temprano, (comments and suggestions, night and day, late or early). This 
list of similes was introduced to disambiguate coordinated groups of capitalized 
words. 

The lists obtained from Internet were: 1) a list of personal names (697 items), 2) a 
list of the main Mexican cities (910 items) considered in the list of telephone codes. 



Application of the method. Perl programs were built for the following steps that 
have been taken for delimiting named entities: 

First step: All composite capital words with functional words are grouped in one 
compound. We use a dictionary with part of speech to detect functional words. 

Second step. Using the previous resources (statistics of newspaper #2 and lists) and 
the rules and heuristics above described the program decides on splitting, delimiting 
or leaving as is each compound. The process is 1) look up the compound in the 
acronym list, 2) decide on coordinated groups using the list of similes, rules (based on 
enumeration and statistics), 3) decide on prepositional phrases using rules, heuristics 
and statistics, 4) delimit possible titles using context cues and rules, and 5) decide on 
the rest of groups of capitalized words using heuristics and statistics. 
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Table 2. Results in a testing set of sentences 





NUMBER OF: 




COORD. 

GROUPS 


PREP. PHRASE 
GROUPS 


TITLES 


ALL 


Precision 


56 


70 


55 


90 


Recall 


49 


67 


32 


88 



5 Results 

We test the results of our method in 400 sentences of newspaper#4. They were 
manually annotated and compared against the results obtained with our method. The 
results are presented in Table 2 where: 

Precision: # of correct entities detected / # of entities detected 

Recall: # of correct entities detected / # of entities manually labeled (eml) 

The table indicates the performance for coordinated names (55 eml), prepositional 
groups^ (137 eml) and titles (19 eml). The last column shows the overall performance 
(1279 eml) including the previous ones. The main causes of errors are: 1) foreign 
words, 2) personal names missing in the available list, and 3) names of cities. 

The overall results obtained by [3] in Spanish texts for name entity recognition 
were 92.45% for precision and 90.88% for recall. But test file only includes one 
coordinated name and in case a named entity is embedded in another name entity only 
the top level entity was marked. In our work the last case was marked incorrect. 

The worst result was that of title recognition since 60% of them were not 
introduced by a cue. Recognition of titles and named entities with coordinated words 
should require enlargement of current sources. The 40% of coordinated correct 
entities detection was based on the list of similes that could be manually enlarged. 



6 Conclusions 

In this work, we present a method to identify and disambiguate groups of capitalized 
words. We are interested in minimum use of complex tools. Therefore, our method 
use extremely small lists and a dictionary with part of speech. Since limited resources 
use cause robust and velocity of execution, important characteristics for processing 
huge quantity of texts. 

Our work is focused on composite named entities (names with coordinated 
constituents, names with several prepositional phrases, and names of songs, books, 
movies, etc.) The strategy of our method is the use of heterogeneous knowledge to 
decide on splitting or joining groups with capitalized words. We confirmed that 
conventions are very similar in different newspapers then heuristics are applicable in 
the four newspapers selected. 



^ Where all prepositional phrases related to acronyms were not considered in this results. 
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The results were obtained from 400 sentences that correspond to different topics. 
The preliminary results shows the possibilities of the method and the required 
information for better results. 
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Abstract. The paper presents a method of automatic enrichment of a very large 
dictionary of word combinations. The method is based on results of automatic 
syntactic analysis (parsing) of sentences. The dependency formalism is used for 
representation of syntactic trees that allows for easier treatment of information 
about syntactic compatibility. Evaluation of the method is presented for the 
Spanish language based on comparison of the automatically generated results 
with manually marked word combinations. 

Keywords: Collocations, parsing, dependency grammar, Spanish. 



1 Introduction 

There is a growing demand for linguistic resources in modern linguistics and 
especially in natural language processing. One of the important types of resources is 
the dictionary that reflects mutual combination of words. 

The problem concerning the types of information about compatibility of words that 
should be stored in the dictionary has a rather long history - first papers appeared in 
50s. Basically, the focus of attention of the researchers was the concept of collocation 
and its usage. The main research direction was integration of this concept into 
lexicographical practice and methods of teaching of foreign languages - how many 
examples should the dictionaries or textbooks contain, are the examples of 
collocations just examples of usage or essential part of knowledge of a language 
(language competence). 
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After many discussions, the common point is that it is very difficult to find a 
concise and formal definition of collocation. Nevertheless, it seems that the majority 
of investigators agree that collocations are very important part of knowledge of 
language and they are useful for different tasks of automatic natural language 
processing like automatic translation, text generation, intelligent information retrieval, 
etc. All this causes the necessity of compilation of specialized dictionaries of 
collocations and even of free word combinations. See next section for more detailed 
discussion of the concept of collocations. 

There exist many methods of extracting collocations that are based on the analysis 
of a large corpus [1, 3, 8, 10, 11, 18]. Still, the majority of them are oriented to 
searching of highly repetitive combinations of words based on measuring of their 
mutual information. These methods do not guarantee finding the collocations if they 
do not have sufficiently high frequency. Usually, the great number of collocations 
does not have this frequency. Besides, the corpus size for such search should be larger 
than the existing corpora (now measured in gigabytes). 

There are several attempts to apply the results of automatic syntactic analysis 
(parsing) for compilation of dictionaries of collocations [3, 7]. For example, in a 
recent work Strzalkowski [17] uses the syntactic analysis for improving results of 
information retrieval by enriching the query. One of the classic works on the theme is 
[15]. The system Xtract is presented that allows for finding repeated co-occurrences 
of words based on their mutual information. The work consists in three stages, and, at 
the third stage, the partial syntactic analysis is used for filtering out the pairs that do 
not have a syntactic relation. Unfortunately, all these methods are applied to word 
pairs obtained by frequency analysis using a threshold. The aim of all these methods 
is collocations, and not free word combinations (see below). As far as the Xtract 
system is concerned, it is reported rather high precision (80%) and recall (94%), 
nevertheless, the evaluation was done by comparison of results with opinion of only 
one lexicographer, and what is collocation according to the system remains unclear, 
obviously, they did not process free word combinations. 

There are already some resources of the described type available. One of the 
largest dictionaries of collocations and free word combinations is CrossLexica system 
[4, 5, 6]. It contains about 750,000 word combinations for Russian with semantic 
relations between the words and the possibilities of inference. There is also this type 
of resources for the English language, e.g., Oxford dictionary of collocations [14] 
(170,000 word combinations) or Collins dictionary [2] (140,000 word combinations), 
though they do not contain semantic relations. This is the lower bound of the 
dictionary of word combinations, which justifies the term very large dictionary in the 
title of this paper. 

In the rest of the paper, we first discuss the concept of collocation and its relation 
with free word combination, and then we describe the method of enrichment of the 
dictionary based on automatic syntactic analysis with dependency representation. 
After this, we evaluate its performance, and finally draw some conclusions. 



2 Idioms, Collocations, and Free Word Combinations 

Now let us discuss the concept of collocation in more detail. Intuitively, collocation is 
a combination of words that has certain tendency to be used together. Still, the 
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strength of this tendency is different for different combinations. Thus, collocations 
can be thought of as a scale with different grades of strength of the inter-word 
relation, from idioms to free word combinations. 

On the one side of the scale there are complete idioms like “to kick the bucket”, 
where neither the word “to kick”, nor “the bucket” can be replaced without destroying 
the meaning of the word combination. In this case, the meaning of the whole is not 
related with the meaning of the components. In certain much more rare cases, the 
meaning of the whole has the relation with the meaning of the components, but also it 
has an additional part that cannot be inferred, e.g., “to give the breast” that means “to 
feed a baby using breast". Though the physical situation is described correctly by 
using the words “to give” and “the breast’, the meaning of “feeding” is obtained from 
the general knowledge about the world. This case can be considered as a little shift on 
the scale towards free word combinations, nevertheless, this type of combinations are 
still idioms (“nearly-idiom” according to Mel’chuk [13]). 

On the other side of the scale there are free combinations of words, like “to see a 
book”, where any word of the pair can be substituted by a rather large class of words 
and the meaning of the whole is the sum of the meanings of the constituent words. 

Somewhere in the middle on this scale, there are lexical functions* [13] like “to pay 
attention”. In this case, the meaning of the whole is directly related only with one 
word (in the example above, the word attention), while the other word expresses a 
certain standard semantic relation between actants of the situation. The same relation 
is found, for example, in combinations like “to be on strike”, “to let out a cry”, etc. 
Usually, for a given semantic relation and for a given word that should conserve its 
meaning, there is a unique way to choose the word for expressing the relation in a 
given language. For example, in English it is to pay, while in Spanish it is prestar 
atencion (lit. to borrow attention), in Russian - obratit’ vnimanije (lit. to turn 
attention to), etc. 

As far as free word combinations are concerned, it can be seen that some free word 
combinations are “less free” than the others, though they are still free word 
combinations in a sense that the meaning of the whole is sum of meanings of the 
constituent words. The degree of freedom depends on how many words can be used 
as substitutes of each word. The less is the number of substitutes, the more 
“idiomatic” is the word combination, though these combinations will never reach 
neither idioms nor lexical functions where the meaning cannot be summed. 

It is obvious that the restrictions of freedom in free word combinations are the 
semantic constraints, for example, “to see a book” is less idiomatic than “to read a 
book” because there are much more words that can substitute a book combining with 
the verb to see, than with the verb to read. Namely, practically any physical object 
can be seen, while only objects that contained some written information (or its 
metaphoric extension, like “to read signs of anger in his face”) can be read. 

Another important point is that some free word combinations can have associative 
relations between its members, e.g., a rabbit can hop, and a flea also, but a wolf 
usually does not hop, though potentially it can move in this manner. This makes some 
combinations more idiomatic because the inter-word relation is strengthened by 
association. 



* Lexical functions were discovered by Mel’cuk in 70s. Unfortunately, till now they are not 
reflected systematically even in good dictionaries. This is because the work of finding these 
functions is rather laborious and it needs very high-level lexicographic competence. 
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In a strict sense, only lexical functions are collocations, but the common treatment 
of this concept also expands it to free word combinations that are “more idiomatic”. 
Since there is no obvious border between more idiomatic and less idiomatic, the 
concept of collocation finally can cover all free word comhinations as well, though 
this makes this concept useless because its purpose is to distinguished idiomatic word 
comhinations from the free ones. Thus, in our opinion, the difficulties related with the 
concept of collocations are related with impossibility to draw the exact border 
between it and free word combinations. 

Note that the obvious solution to treat collocations only as lexical functions 
contradicts to the common practice. This demonstrates that, in any case, we need 
something to distinguish between more and less idiomatic free word combinations. If 
it is not collocation, then the other term should be invented. 



3 Automatic Enrichment of the Dictionary 

Traditionally, free word combinations are considered of no interest to linguistics, 
though, in fact, practically any free word comhination is “idiomatic” to a certain 
grade, because the majority of them have certain semantic restrictions on 
compatihility. In our opinion, it is so, because, according to the famous Firth idea 
“you shall know the word by the company it keeps”, any word combination is 
important. For example, in automatic translation, some wrong hypothesis can be 
eliminated using the context [16]; in language learning, the possibility to know the 
compatibility allows for much better comprehension of a word; not speaking about 
automatic word sense disambiguation, where one of the leading approaches is 
analysis of the context for searching of the compatible words, etc. 

Note that manual compilation or enrichment of the dictionary of free word 
combinations is very time-consuming, for example, CrossLexica [5] was being 
complied during more than 1 3 years and it is very far from completion yet. 

We suggest the following method of automatic enrichment of such kind of 
dictionaries. Obviously, the method needs some post-verification, because we cannot 
guarantee the total correctness of the automatic syntactic analysis, still, it is much 
more efficient than to do it manually. 

We work with the Spanish language, but the method is easily applicable for any 
other language depending on the availability of a grammar and a parser. First, we 
apply the automatic syntactic analysis using the parser and the grammar of Spanish 
developed in our laboratory [9]. The results of the syntactic analysis are represented 
using the formalism of dependencies [12]. It is well known that the expressive power 
of this formalism is equal to the formalism of constituents, that is much more 
commonly used, but the procedure of treatment of word combinations is much more 
easy using dependencies. 

The idea of the formalism of dependencies is that any word has dependency 
relations with the other words in a sentence. The relations are associated directly with 
word pairs, so it is not necessary to pass the constituency tree in order to obtain the 
relation. One word always is a head of relation, and the other one is its dependant. 
Obviously, one headword can have several dependencies. 
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The problems that are to be solved even using this formalism are the treatment of 
coordination conjunctions and prepositions, and filtering of some types of relations 
and some types of nodes (pronouns, articles, etc.). 

We store the obtained combinations in the database. All members of the pairs are 
normalized. Still, some information about the form of the dependant is saved also. In 
our case, for nouns, we save the information about its number (singular or plural), 
say, “play game Sg” and “play game PI” (the word combination in both cases is “play 
game”, and it has an additional mark); for verbs, the information if it is a gerund or a 
participle is important, etc. 

The coordinative conjunctions are heads in the coordinative relation; still, the word 
combinations that should be added to the dictionary are the combinations with their 
dependants. For example, I read a book and a letter, the combinations that should be 
extracted are read book and read letter. Thus, the algorithm detects this situation and 
generates two virtual combinations that are added to the dictionary. 

Treatment of prepositional relation is different from other relations. Since the 
prepositions usually express grammar relations between words (for example, in other 
languages these relations can be expressed by grammar cases), the important relation 
is not relation with the preposition, but the relations between two lexical units 
connected by the preposition. Still, the preposition itself is also of linguistic interest, 
so we reflect this relation in the dictionary by the word combination that contains 
three members; the headword of the preposition, the preposition, and its dependant, 
e.g.. He plays with a child gives the combination play with child. 

Filtering of determined types of nodes is very easy. Since the parser uses the 
automatic morphological analysis, the morphological information for every word is 
available. It allows for filtering out the combinations without significant lexical 
contents, i.e., if at least one word in the combinations belongs to one of the following 
categories with mainly grammatical meaning. The following categories are discarded 
in the actual version of the algorithm; pronouns (personal, demonstrative, etc.), 
articles, subordinate conjunctions, negation (not), and numerals. Since the 
combinations with these words have no lexical meaning, they have no semantic 
restrictions on compatibility, and can be considered as “absolutely” free word 
combinations. These combinations are of no interest for the dictionary under 
consideration. 

The other filter is for the types of relations. It depends on the grammar that is used. 
In our grammar, the following relations are present; dobj (direct object), subj 
(subject), obj (indirect object), det (determinative), adver (adverbial), cir 
(circumstantial), prep (prepositional), mod (modifying), subord (subordinate), coord 
(coordinative). Among these relations, the prepositional and coordinative are treated 
in a special mode, as mentioned above. The only relations left that are of no use for 
detecting of word combinations are subordinate relation and circumstantial relation. 

One of the advantages of the suggested method is that it does not need corpus for 
its functioning, and, thus, there is no dependency of the corpus size or corpus lexical 
structure. 

Let us have a look at the example of the functioning of the method. The following 
sentence is automatically parsed. 

Conocia todos los recovecos del rw y sus misterios. 

(I knew all detours of the river and its mysteries.) 
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The following dependency tree corresponds to this sentence. The hierarchy of 
depth in the tree corresponds to the relations (number of spaces at the beginning of 
each line^). For example, V(SG,1PRS,MEAN) [conocia] is head of the sentence and 
its dependants are CONJ_C [y] and $PERIOD. CONJ_C has dependants 
N(PL,MASC) [recovecos] and N(PL,MASC) [misterios], etc. Each line corresponds 
to a word and contains the word form and its lemma, e.g., conocia : conocer {knew : 
know), etc. 

V(SG,1PRS,MEAN) -> () // Conocia : conocer {knew : know) 

CONJ_C -> (obj) H y : y {and : and) 

N(PL,MASC) -> 0 // recovecos : recoveco {detours : detour) 

PR -> (prep) // del : del {of the : of the) 

N(SG,MASC) -> (prep) // rw : rw {river : river) 

ART(PL,MASC) -> (det) // los : el {the : the) 

#*$$todo# -> 0 <*$$todo> // todos : todo {all : all) 

N(PL,MASC) -> (coord_conj) // misterios : misterio {mysteries : mystery) 
DET(PL,MASC) -> (det) // sus : su {its : it) 

$PERIOD -> ()//.:. 

The following word combinations were detected: 

conocer (obj) recoveco {to know detour) 
conocer (obj) misterio (to know mystery) 
recoveco (prep) [del] rw (detour of the river) 

It can be seen that the relation (obj) corresponds to coordinative conjunction, and 
then it is propagated to its dependants: recoveco (detour) and misterio (mystery). The 
preposition del is part of the 3-member word combination. The articles and pronouns 
are filtered out (el, todo, su), though the algorithm found the corresponding word 
combinations. 



4 Evaluation 

We conducted the experiments on the randomly chosen text in Spanish from 
Cervantes Digital Library. Totally 60 sentences were parsed that contain 741 words, 
average 12.4 words per sentence. For evaluation, we manually marked all dependency 
relations in the sentences. Then we compared the automatically added word 
combinations with manually marked word combinations. 

Apart, we used as a baseline a method of gathering the word combinations that 
takes all word pairs that are immediate neighbors. Also we added certain intelligence 
to this baseline method - it ignores the articles and takes into account the 
prepositions. Totally, there are 153 articles and prepositions in the sentences, so the 
number of words for baseline method is 741-153 = 588. 

The following results were obtained. The total number of correct manually marked 
word combinations is 208. From these, 148 word combinations were found by our 



^ Usually, the arrows are used to show the dependencies between words, but it is 
uncomfortable to work with arrows in text files, so we use this method of representation. 
Besides, this representation is much more similar to constituency formalism. 
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method. At the same time, the baseline method found correctly 111 word 
combinations. On the other hand, our method found only 63 incorrect word 
combinations, while the baseline method marked as a word combination 588*2 - 1 = 
1175 pairs, from which 1175-111 = 1064 are wrong pairs. 

These numbers give us the following values of precision and recall. Let us remind 
that precision is the relation of the correctly found to totally found, while the recall is 
the relation of the correctly found to the total that should have been found. For our 
method, precision is 148 / (148+63) = 0.70 and recall is 148 / 208 = 0.71. For the 
baseline method, precision is 111 / 1175 = 0.09 and recall is 111 / 208 = 0.53. It is 
obvious that precision of our method is much better and recall is better than these 
parameters of the baseline method. 



5 Conclusions 

A dictionary of free word combinations is very important linguistic resource. Still, 
compiling and enriching of this dictionary manually is too time and effort consuming 
task. We proposed a method that allow for enrichment of such dictionary semi- 
automatic ally. The method is based on parsing using the dependency formalism and 
further extraction of word combinations. Some types of relations and some types of 
nodes are filtered because they do not represent substantial lexical information. 
Special processing of coordinative and prepositional relations is performed. The 
method requires post-processing of obtained word combinations, but only for 
verification that no parser errors are present. 

The results are evaluated on a randomly chosen text in Spanish. Proposed method 
has much higher precision and better recall than the baseline method. 
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Abstract. In a statistical machine translation system (SMTS), decod- 
ing is the process of finding the most likely translation based on a sta- 
tistical model according to previously learned parameters. This paper 
proposes a new approach based on evolutionary hybrid algorithms to 
translate sentences in a specific technical context. The tests are carried 
out in the context of Spanish and then translated to English. The exper- 
imental results validate the performance of our method. 



1 Introduction 

Machine Translation (MT) is the process of automatic translation from one nat- 
ural language to another using a computer program. It is often argued that 
the problem of MT requires the problem of natural language understanding to 
be solved first. However, a number of empirical methods of translation works 
surprisingly well [1]. Between these methods we find Statistical-based methods 
which try to generate translations based on bilingual text corpora. Statistical 
machine translation (SMT) was first introduced by Brown et. al. [11] in the 90’s. 
In order to design a SMT, that can translate a source sentence s (for example 
Spanish) into a target sentence e (for example English), the following compo- 
nents are required: 

A language model (LM) that assigns a probability P{e) to each English string. 
A translation model (TM) that assigns a probability T’(sje) to each pair of En- 
glish and Spanish string. 

A decoder, it uses for input a new sentence s and tries to generate as output 
a translated sentence e, that maximizes the translation probability P(ejs), or 
according to the Bayes Rule, that equivalently maximizes P(e) • P{s\e). 

There exists efficient algorithms to estimate the probabilities for a language 
model, like n-grams models [2]. Translations models are usually based on word 
replacement models developed by IBM in the early 1990s [11]. These models 
are referred to as (IBM) Models 1-5. In this paper, we focus our attention on 
the decoder. A good decoding or search algorithm is critical to the success of 
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any SMT system [8]. Knight [7] has shown that a decoding problem is a NP- 
complete. Thus, because of decoding problem complexity, optimal decoders, i.e. 
decoders that guarantee to find optimal solutions are not used in practical SMT 
implementation. However, some approximated decoders have been proposed in 
the literature, as stack or A* algorithms [4], dynamic programming based al- 
gorithms [15], and greedy heuristic based algorithms [8,5]. In this paper, we 
introduce an evolutionary decoding algorithm, in order to improve the efficiency 
of the translation task in the SMT framework. The translation is performed from 
Spanish to English sentences, in the context of the computer science technical 
area. The next section presents the SMT focusing on the IBM 4 Model and a 
description of the problem. In section 3 we introduce the Evolutionary Decoding 
Algorithm (EDA) that solves the specific decoding problem. Specialized Opera- 
tors and mechanisms for parameters control included in EDA are also described. 
Section 4 presents the tests and results of a sets of translations from Spanish to 
English. Finally, we will present our conclusions and further work. 



2 Statistical Machine Translation 

With a few exceptions [14] most SMT systems are based on the noisy channel 
framework (see figure 1) that have been succesfully applied to speech recog- 
nition. In the Machine Translation framework the sentences e is written in a 
source language, for example English, it is then supposed to be transformed by 
a noisy probabilistic channel that generates the equivalent target sentences s, 
which in our case in Spanish. Decoding is the process to take as input any string 
in the target language s and to find the source e of highest probability that 
matches the target. The source language is the language into which the SMT 
translates. The two keys notions involved are those of the language model and 
the translation model. The language model provides us with probabilities for 
strings of words or sentences P{e), this is estimated using a monolingual corpus. 
The source vocabulary S is defined by the set of all different words in the source 
corpus, analogously for the target vocabulary S. 

Broadly speaking the probabilities for each sentence from source corpus are 
independently computed. The translation model provides the conditional prob- 
abilities. In order to estimate the conditional probability P{s\e), that occurs in 




argmax P( e|j) = argmax P(e)P(j|e) 



Fig. 1. The noisy channel model 
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Tarset s: Mary no dio una bofetada a la bmia verde 

1 2 34 567-'89 

Alignment a:[l 3 4 4 4 057 6] 

Fig. 2. An example of alignment for English and Spanish sentences 

a target sentence s, the target text which translates a text containing the source 
sentence e requires a large bilingual aligned corpus. Usually the parameters of 
both the language and the translation models are estimated using traditional 
maximum likelihood and expectation maximization techniques [3]. Translation 
is the problem of finding the e that is most probable given s. 

Our work is based on the IBM Model 4 translation model. Before we begin to 
describe this model, it is useful to introduce further notions. In a word aligned 
sentence-pair, it is indicated which target words correspond to each source word 
as shown in Figure 2. 

For that, the following variables are required: 

I the number of words of e, m the number of words of s 
&i the zth word of e, Sj the jth word of s 

The fertility of a source word is determined by the number of corresponding 
words in the target string. In theory, an alignment can correspond to any set 
of connections. However, IBM’s models are restricted to alignments where each 
target word is at most corresponding to one source word. It is possible to repre- 
sent the alignment a as a vector (ai, 02 , ..., Om), where the value of Ofc indicates 
the word position in the source sentence that corresponds to the kth word in the 
target sentence. When a target word is not connected to any source word its 
value is equal to zero, this is illustrated in figure 2 using a NULL symbol. 



Translation Model 



Lexical Model 




Fertility Model 


n^(sj|ea,) 







Distortion Model 

Head ~ Cpi\Aiepi),B{sj)) 

Non-Head ]~[d>i(ji — j'\B{sj)) 



NULL Translation Model 



Fig. 3. Translation Model (IBM Model 4) 



2.1 IBM Model 4 

This model uses the following sub-models to compute the conditional probabil- 
ities P{a, s|e): 

Lexical Model t{sj\ei): Word-for-word translation model, represents the prob- 
ability of a word sj corresponding to the word e*, i.e. Ci is aligned with Sj. 
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Fertility Model n{4>i\ei): Represents the probability of a source word Cj to be 
aligned to 4>i words in the target sentence s. When a source word is not aligned 
to a target word cj)i it is equal to zero. 

Distortion Model d: This model captures the probability that the position of 
a word in the source language matches in the target language. For this the model 
uses a cluster technique to define word classes -4(ei) for the source language and 
word classes B{sj) for the target language. The words are clustered using some 
similarities criteria. The Distortion model is also broken down into two sets of 
parameters: 

Distortion Probability for head words: A head word is the first Sj word aligned 
with Cj, with fertility (f)i not equal to zero. It is denoted by the following ex- 
pression: d\{j — Cp^\A{epf),B{sj)), where j is the head word position, pi is the 
position of the first word fertile to the left of Cj, Cp^ is the representative position 
of the word in s of its alignment with Cp^ . If the fertility of word ep^ is greater 
than one, its representative position is calculated as the upper bound of the av- 
erage of all words positions aligned with it. 

Distortion probability for non-head words: dy\{j — j'\B{sj)) : If the word has 
a fertility greater than one, j represents the position of a non-head word and j' 
is the position of the first word to the left of Sj which is aligned with . 
NULL Translation Model pi: this allows for the calculation of the probability 
of the number of words of s aligned to NULL value in the target sentence. 
Finally, the P{a,s\e) is calculated by multiplying all the sub-models probabili- 
ties described above. See Brown et. al. [11] and Germann et. al. [8] for a detailed 
discussion of this translation model and a description of its parameters. 

2.2 Decoder Problem 

In this paper we focus our attention on the decoding problem. Knight in [7] 
has shown that for an arbitrary word-reordering, the decoding problem is NP- 
Complete. Roughly speaking, a decoder takes as input a new sentence s and tries 
to generate as output a translated sentence e, that maximizes the translation 
probability P(ejs), where 



P{e\s) = P{e)P{s\e) 


(1) 


P(s|e) = ^P(a,s|e) 


(2) 



a 



is the addition of P{a,s\e) over all possible alignments a. In practice it is 
infeasible to consider all alignments. Thus, an approximate value of 

P(s|e) ~ P(a, sje) is used to find the most probable < e, a > that maximizes 
P(e)P(a,s|e). 

3 Evolutionary Decoding Algorithm 

Evolutionary algorithms (EAs) [9] start by initializing a set of possible solutions 
called individuals. It is an iterative process that tries to improve the average 
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Problem i = Mary no dio una bofetada ala bmja verde 

Chromosome = [13 4 4 4 0576] (NULL, Mary, did, not, slap, the, green, witch) 

^ Alignment structure a ^ ^ Translation structure e ^ 

Fig. 4. Example of Chromosome 

fitness of a set of individuals by applying a transformation procedure to a set of 
selected individuals to construct a new population. After some criterion is met, 
the algorithm returns the best individuals of the population. In this section we 
present the components of an evolutionary decoding algorithm (EDA), specially 
designed to solve the decoding problem. The goal of EDA is to find both good 
alignments and sentences translated. 

3.1 Individuals Representation 

A chromosome is composed of two related structures: An alignment structure 
and a translation structure. The alignment structure as defined in section 2, that 
is a vector of a fixed length of integers Oj . The translation structure is a string 
of variable length of tokens and it represents a translated sentence. Figure 
4 shows the representation of the example of the figure 2. This representation 
contains all information that the algorithm requires to do a fast evaluation and 
to use specialized genetic operators. 

3.2 Initial Population 

To generate the initial population we developed a greedy randomized construc- 
tion heuristic. This method attempts to assure diversity of the initial population. 
The algorithm is presented in figure 5. For selection, roulette wheel method [9] 
is used. EDA also uses elitism. 

Procedure Generate Initial-Population 
Begin 

For each word Sj of the sentence s to be translated 

Generate its Restricted Candidate List (RCLj) composed by cl 
words belonging to £, which have the higher ti{ei\sj)^ probabilities 
For 1=1 to popsize 
For j=l to m 

Chrom[l].ej = random{RC Lj) 

Chrom[l].aj = j 

End 



Fig. 5. Structure of Initial Population Algorithm 



ti from lexical model, represents the probability of a word Ci is a translation of a 
word Sj 



1 
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3.3 Fitness Function 

In order to have a better and significant discrimination between the fitness values, 
we define the following evaluation function: 

e = argmin {—log{P{e)) — log{P{s\e)) (3) 

e 

The log function is monotonic, thus the lowest fitness value should correspond 
to the best translation. It works according to the training parameters of both 
models. Furthermore, it simplifies the partial evaluations of all models. 

3.4 Specialized Recombination Operators 

We designed three different recombination operators. The goal is exchanging 
translation information between the parents to create a new best individual. All 
of these operators create two offsprings, but select the best of them to continue 
to the next generation. The figure 6 shows an example of applying each operator 
for the same sentence to be translated. 



Source sentence 


“La cantidad de variables es desconocida” 


Parent I 
Parent II 


[0 1 2 3 4 5 [ (NULL, Quantity,of, variables, is, unknown) 

[1 2 0 3 5 4 [ (NULL, The, amount,variables, unknown, holds) 



One Point Alignment Crossover 

Offspring I [ 1 2 0 | 3 4 5 ] (NULL, The, amount , variables , is , unknown) 

Offspring II [ 0 1 2 | 3 5 4 ] (NULL, Quantity, of, variables, unknown, holds) 

Lexical Exchange Crossover 

Offspring I [0 1 2 3 4 5] (NULL, Amount, of , variables, holds , unknown) 

Offspring II [1 2 0 3 5 4] (NULL, The, quantity ,variables, unknown, is) 

Greedy Lexical Crossover 

Offspring I [0 1 2 3 4 5] (NULL,Amount,of, ,variables, is, unknown) 

Offspring II [1 2 0 3 5 4] (NULL, The, amount,variables, unknown, is) 



Fig. 6. Example of Recombination Operators 



One Point Alignment Crossover: This operator makes a crossover on the 
alignment part of the representation, the translation structure is changed ac- 
cording to the new alignment obtained in each offspring. It is shown in figure 7. 
Lexical Exchange Crossover: This is a fast operator that focuses on the 
exchange lexical components of the chromosome. Both children inherit an align- 
ment structure from their parents. In order to construct the child translation 
structure, synonymous words from both parents are interchanged according to 
its alignment. 

Greedy Lexical Crossover: This algorithm tries to find a good alignment. 
Each child has the same alignment structure of each parent. In order to con- 
struct each child translation structure, the best translated word for Sj from 
both parents is selected using the lexical model t{sj\ei). This word is located in 
each child translation structure in the position determined by its alignment. 
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Procedure One Point Alignment Crossover (Parenti, Parent 2 ) 

Begin 

Randomly select a position p in the alignment section 

Cross the section alignment of the parents from 1 to p positions 

Construct the translation structure of each offspring according to the new 

alignment, inheriting the corresponding words from the parents 

End 



Fig. 7. Structure of One Point Alignment Crossover 



3.5 Specialized Asexual Operators 

We propose two exploration operators which help the algorithm to escape from 
a local optima: 

Mutation Word: This operator acts in the translation structure selecting a 
word Sk- The current word Ci that translates Sk is replaced by a randomly se- 
lected synonym word from (RCLk). 

Simple Swap: This operator randomly selects the position of two words on the 
translation structure and swaps their words. It modifies the alignment structure 
according to the new words positions. 

Because it is a complex problem [6] , we also include two hybrids operators which 
perform a local search procedure. 

Language Model Local Search: It is a local search operator that works on 
the language model. It is a hill-climbing procedure which goal is to improve 
the language model probability that is calculated using trigrams partitions. The 
operator analyzes each sequence of three consecutive words of the translation 
structure. It uses a partial evaluation of the six permutations in order to select 
the best ordering trigram between them. Finally, when all trigrams have been 
analyzed and probably changed the algorithm makes a global evaluation to ac- 
cept or reject the new individual. 

Translation Model Local Search: This is a best improvement operator that 
uses the features of IBM model 4. It works with the fertility concept. The algo- 
rithm is shown in figure 8. 

It is an exhaustive procedure that at the beginning tries to insert zero fertility 
words from the vocabulary in the translation structure. In second step deletes 
zero fertility words included in the translation structure which increases the 
evaluation function value. The next step is focused on fertility. The idea is to 
increase words fertility of the translation structure in order to reduce the number 
of words in the translation. This augmentation of fertility is accepted only if the 
evaluation function improves. 

3.6 Parameter Control Mechanisms in EDA 

The algorithm manages the most critical parameters as recombination and mu- 
tation operators probabilities with an adaptive parameter control strategy. The 
goal is to find a better combination of the parameters changing their values dur- 
ing the execution according to the state of the search [12]. In the beginning all 
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Procedure Translation Model Local Search (.Chromosome) 

Begin 

For each word Cz with zero fertility in £ 

Tries to insert Cz in the position of the translation structure 
which gives the best improvement of the evaluation of the language model 
For each word Ci with zero fertility in the translation structure 
Delete Ci when this action improves the evaluation function 
For ii = l to I 

For i 2 = 1 to Z 
If ii ^ i2 

If 4>ii > 1 and > 0 

link all Sj words aligned with Cij to Cij 
delete Cij from the translation structure 

End 



Fig. 8. Structure of Translation Model Local Search 



the operators probabilities are equal. The algorithm computes a ranking based 
on the accumulated statistical information during g generations. It classifies the 
operators according to their successfulness at finding good offsprings. It gives a 
reward to the operator that produces the better offsprings increasing its prob- 
ability. Therefore, the probabilities of the worse operators are reduced. It is 
represented by the following equation: 



Pi,t+i — (1 — a) • Ri.t + a. ■ Pi.t 



( 4 ) 



where Pi.t is the probability of the operator i in the generation t, Ri t is the 
reward and a, a momentum parameter used to smooth the probabilities changes. 



4 Experimental Results 

We used articles from the bilingual ACM crossroads magazine^ construct the 
corpus, shown in the table 1. We obtained 4812 bilingual sentences english- 
spanish in the context of computer science, of which 4732 sentences are used for 
training and 80 sentences for testing. 



Table 1. Training and test conditions for the computational corpus. 





Spanish 


English 


Vocabulary 


8681 


6721 


Training: Sentences 
Words 


4732 


92933 1 84650 


Test: Sentences 

Words 


80 


1125 1 1051 



^ http://www.acm.org/crossroads 
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We used CMU-Cambridge Statistical Language Modeling Toolkit v2^ for 
the training language model. GIZA+T"* The toolkit permitted to estimate the 
parameters of the translation model. 

The algorithm works with the following tuned parameters: population size 50, 
maximum number of generations 50 and cl = 15. The more critical parameters 
use a dynamic adaptive parameter control mechanism described in section 3.7, 
with a = 0.4, g = 5 and rewards matrix 0.5,0.35,0.15. The hardware platform 
was a PC Pentium III, 870 Mhz, with 256 Mb RAM under Linux 9.0. 
Performance Measures: There are two widely used metrics [13] to measure 
quality in MT: Word Error Rate (WER) and Position-independent Error Rate 
(PER). WER corresponds to the number of transformations (insert, replace, 
delete) to be done to the translation solution generated by EDA, in order to 
obtain the reference translation. PER is similar to WER, but it only takes into 
account the number of replace actions to do. 

4.1 Tests 

The algorithm is evaluated using two tests classes: the first one is a compari- 
son between EDA and the Greedy Decoder isi rewrite decoder v. 0. 7b^ which is 
based on the work of Germman et. al. [8]. Both decoders used the same lan- 
guage model, the same translation model and they were trained with the same 
corpus. Finally, from the translations obtained by using as a decoder our evolu- 
tionary algorithm is compared to other general domain public translators. Table 
2 shows EDA outperforms the greedy decoder. We remarked that EDA obtains 
better alignments of the sentence than the greedy decoder and the translation 
generated by it are closest to the reference sentence. The results of EDA are 
more remarkable in WER than PER, because PER takes into account either the 
words but not the full sentence alignment. The specialized evolutionary opera- 
tors enables the algorithm to do a search focused on both the words and their 
alignment. Finally, we compare our results with two public domain translators: 
Babelfish® and SDL International’^. EDA outperforms their translations quality 
because it was especially trained and designed for computer science context. 

5 Conclusions 

Using an evolutionary approach to translate with the SMT framework is feasible 
and comparable in quality with others techniques using the same framework and 
other kinds of translators. General translators are powerful for a wide application 
areas, in contrast EDA shows a better performance due to its specific context. 
There are a variety of statistical translation models, all of them need a decoder for 

® http://mi.eng.cam.ac.uk/ prcl4/toolkit.html 

^ http: / /www-i6.informatik.rwth-aachen.de/web/Software/GIZA-|— l.html 
® http: / /www. isi.edu/natural-language/software/decoder/manual.html 
® http://babelfish.altavista.com/ 

^ http:/ /ets. freetranslation.com/ 
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Table 2. Experimental results of translations according to the length of sentences 



Type of Algorithm 


WER % 


Avg. 


PER % 


Avg. 


Sentence Length 


8 


10 


12 


16 


18 




8 


10 


12 


16 


18 




Statistical 


EDA 


18.7 


27.6 


40.2 


36.5 


34.8 


31.5 


17.2 


24.8 


35.4 


32.0 


29.1 


27.7 




Greedy 


29.7 


41.3 


51.4 


44.6 


47.7 


42.9 


20.2 


31.2 


37.9 


33.5 


30.5 


30.7 


General 


BabelFish 


43.2 


57.2 


58.9 


56.2 


47.0 


52.5 


32.5 


41.0 


47.7 


41.6 


34.5 


39.5 


Pupose 


SDL Int. 


40.1 


60.2 


61.8 


67.7 


50.5 


56.1 


29.1 


45.3 


48.5 


52.7 


43.6 


43.8 



translation, the results suggest that our technique is a good option to implement 

a decoder by adapting EDA to the features of an specific statistical model. 
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Abstract. The aim of this paper is to present a model for the interpretation of 
imperative sentences in which reasoning agents play the role of speakers and 
hearers. A requirement is associated with both the person who makes and the 
person who receives the order which prevents the hearer coming to 
inappropriate conclusions about the actions s/he has been commanded to do. By 
relating imperatives with the actions they prescribe, the dynamic aspect of 
imperatives is captured and by using the idea of encapsulation, it is possible to 
distinguish what is demanded from what is not. These two ingredients provide 
agents with the tools to avoid inferential problems in interpretation. 



1 Introduction 

There has been an increasing tendency to formalize theories which describe different 
aspects of computational agents trying to emulate some features of human agents, 
such as reasoning, that are required to perform an autonomous interpretation of 
language, and to follow a course of action. Another example is to formalise relations 
of power among agents, where an agent makes other agents satisfy his/her goals, (e.g. 
[10; 11]). Flere we present a model for imperatives interpretation in which agents 
represent speakers and hearers. Once an agent has uttered an order, the main role of 
the agent addressed is to interpret it and decide what course of actions s/he needs to 
follow, so that the order given can be satisfied. Nevertheless, such autonomous 
reasoning behaviour might lead to wrong conclusions, derived from a weak 
formalization. In the specific case of the interpretation of imperatives, there is an 
additional problem: imperatives do not denote truth values. The term practical 
inference has been used to refer to inferential patterns involving imperatives. For 
instance, if an agent A is addressed with the order Love your neighbours as yourself! 
and A realizes that Alison, is one of those objects referred to as his/her neighbours, 
then A could infer Love Alison as yourself Even though the order given cannot be 
true or false [9; 14; 19]. 

Formalizations in which imperatives are translated into statements of classical 
logic are problematic as they can lead an agent to draw inappropriate conclusions. In 
those approaches, if an agent A is given the order Post the letter!, s/he can 
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erroneously infer that s/he has been ordered to Post the letter or burn the letter! by 
using the rule of introduction for disjunction. 

Thus, having a choice, agent A might decide to burn the letter. In deontic 
approaches this is known as the Paradox of Free Choice Permission, which was 
thought to be an unsolved problem as recently as 1999 [18]. 

Here we present a model which does not suffer from this kind of paradoxical 
behavior. It involves the following ingredients a) agents with the ability to interpret 
imperative sentences within b) a context. It also captures c) the dynamic aspect of 
imperatives, so that imperatives are not translated into truth-denoting statements. 
Finally, d) encapsulation makes agents capable of distinguishing what is uttered from 
what is not, so avoiding ‘putting words in the mouth of the speaker’ . 

The rest of the paper is organized as follows. First as a preamble to the model, the 
concepts of imperative, context and requirement are defined. Then a formalization is 
presented followed by examples illustrating that the model overcomes inferential 
problems in the interpretation of imperatives. The paper ends with some conclusions. 



2 Analysis 

In this section, we describe some of the main concepts which need to be addressed by 
the model in which agents interpret imperatives. As a first step we define imperative 
sentences as they are considered in this paper. 

Definition: Imperative 

Imperatives are sentences used to ask someone to do or not to do something 
and that do not denote truth- values. 

This definition introduces a distinction between different sentences used to ask 
someone to do something. Following the definition, Come here! might convey the 
same request as I would like you to come here. However the former does not denote a 
truth value, whereas the latter does it. The former provides an example of the kind of 
sentences that we shall address here. It is worth to mention that the ‘something’ which 
is requested in an imperative shall be called a requirement. Other examples of 
imperatives are: a) direct: Come here! ; b) negative: Don’t do that!\ c) conjunctive: Sit 
down and listen carefully!; d) disjunctive: Shut up or get out of here!; e) conditional: 
If it is raining, close the window! 



2.1 Context 

It is widely accepted that the interpretation of utterances is context dependent. For 
instance the imperative Eat!, said by a mother to her son, might be an order. However 
said to a guest it might be only an invitation to start eating. The real meaning depend 
on context. 

Many authors, agree that context is related to people’s view or perception of the 
world or a particular situation rather than the world or the situation themselves [2; 
16]. That is, context is conceived in terms of what agents have in their minds. After 
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all this is what an agent uses to interpret a sentence. This might inclnde intentions, 
beliefs, knowledge etc. However we will subscribe to the following definition. 

Definition: Context 

A context is a consistent collection of propositions that reflects a relevant 
subset of agents’ beliefs. 

This view will not commit us here to an ontology or classification of components 
or to the use of operators such as B for beliefs and K for knowledge (Turner [17]). 
We simply assume that all that which constitutes a context can be represented in 
terms of propositions so the context is viewed as a consistent set of propositions [3]. 



2.2 Dynamic Aspect of Imperatives 

Different anthors have related imperatives and actions (Ross [14], von Wright [18], 
Hamblin [6] p. 45 and Segerberg [15] among others). Sometimes it is said that 
imperatives prescribe actions. Nevertheless, it wonld be more precise to say that 
imperatives possess a dynamic aspect. For instance, I would like you to open the door, 
and Open the door! might convey the same request. However the former is a 
statement which denotes a truth value. It can be true or false within a state of affairs, 
bnt there is not a dynamic aspect in it. However the latter, does not denote a truth 
value, but if we assume that it is uttered in a state of affairs in which the door is 
closed, it demands another fnture and wished state of affairs in which the door is 
open. That is, it demands a change of states, it involves a dynamic aspect (Fig. 1). 
This suggests that translating imperatives into statements is the wrong approach; it 
does not model a basic aspect of imperatives. 



Q 

© Open the door! f \ 

iS’i =in it iai state /^pi'e-conditions - door cloned 
iV/=final state (J=post-conditions - door open 



Fig. 1. Dynamic aspect of imperatives 



2.3 Evaluation of Imperatives and Correctness 

When an agent interprets an imperative, s/he also evaluates it. For instance in the 
example above. Open the door! would not make sense in a state of affairs where the 
door is already open. It seems that imperatives impose some pre-conditions that the 
agent verifies during the process of interpretation; the door must be closed. 
Complying with an imperative will produce a result, a post-condition which shall 
indicate that the order has been satisfied; the door will be open. Thus, the dynamic 
aspect of imperatives provides us with at least three components, namely pre- 
conditions, imperative, and post-conditions. This resembles what is known as Hoare’s 
triple [8]. In 1969 Hoare proposed a logic to verify correctness of programs. He 





The Role of Imperatives in Inference, Agents, and Actions 



451 



proposed to evaluate triples P{S}2, where S is a program, P are its pre-conditions, 
and Q are its post-conditions. According to Hoare, the program S is correct iff the 
assertion P is true before initiation of S, and then the assertion Q is true on its 
completion. Since the interpretation of imperatives can be construed as involving a 
verification process, here we adopt the concept of correctness of an imperative which 
is defined analogously by using Hoare’s triple P{Imp}Q. 

Definition: Correctness of an Imperative 

The imperative Imp is correct with respect to a state of affairs Si iff P holds 
in S- and Q holds w.r.t the state reached after the imperative is satisfied. 

An imperative is satisfied when the agent addressed, complies with the imperative, 
reaching the state wished by the speaker. 



2.4 Encapsulation 

If a person is given the order Close all the windows! while being in a house, and s/he 
realizes that the kitchen’s window is open, then the agent might conclude that s/he 
should close that windows, as a derivation of the order given. However the person 
will not assume that his/her inferential derivation Close the kitchen ’s window, means 
that it is an imperative uttered by the speaker. An agent should also distinguish 
between uttered and derived requirements. 

Now we will present the model, illustrating how it is able to describe the main 
features of imperatives and how it overcomes the paradoxical behavior faced by other 
approaches. 



3 Model 

Li„pA is a dynamic language, defined along the lines of first-order dynamic logic as in 
Harel [7]. In this language Hoare’s triples can be represented and, therefore, so can 
the concept of requirement. Ability of an agent (actions that an agent is able to 
perform) also can be represented and its semantics will allow to verify the validity of 
these concepts with respect to context. 



3.1 Definition of Sets 

We define the following sets. C={c, c,, c^,...} is a set of constant symbols. 
Analogously we define set for variable symbols (V); function symbols (F); regular 
constant symbols (C); speaker constant symbols (C5); speaker variable symbols (5'); 
hearer constant symbols (C//); hearer variable symbols (H); atomic actions (AtAcf); 
atomic predicate symbols {AtPred)\ and we assume that AC= C u C5 u CH and AV= 
VljSljH. 
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3.2 Definition of Terms 

Terms are defined recursively by: t ::= c|cs|c/z|v|x|/i|/(fj, t^, fj. Thus, a term is a 
regular constant (c), a speaker constant ics), a hearer constant {ch), a regnlar variable 
(v), a speaker variable (s), a hearer variable {h) or a function (f(t,, tj) of arity n 

(n argnments), where tj, are terms. The expressions ts ::= cj|s and th ::= ch\h 

define the terms for speaker and hearers respectively as constants or variables. 



3.3 Definition of wff of the Language 

The set FOR contains all possible wffs in and the set Act contains all possible 
actions defined in the category of actions. The definition of the language Lj^^^^is given 
by (|) ::= p(t,, ..., tj\t=tj{—i(\) |(|)j a (|)2 | 3 x(|) |[a](|). In other words, if peAtPred, t,, t^, 

..., are terms, xe V, and as Act, then p(t^, t^, ..., tj is an atomic predicate, with arity 
n. is the equality test (=). — 1(|) is the negation of (|). *^he conjunction of (|) 

and \|/. 3 j:(|) is the existential qnantifier. [a](|) is a modal expression indicating that (|) 
holds after the action a is performed. The usnal abbreviations are assumed: (|),v(|)2 = 
— — ^(1^2 “ — “ (^1 — ^^2)^(^2 — = — iBx — ^(j) and <ocxj) = 
— ^[oc] — 



3.4 Category of Actions 

The set Act of actions is defined as follows: a ::= a(t^, t^, ..., tJ|(|)?|aj;a2|aj+ 
a2|(a)t^ul(a)t^. In other words, if a, ttj, a^ e Act, q, t^, ..., q are terms and ts, th are 
terms for speaker and hearer respectively then a{t^, t^, ..., q) is the atomic action. a,;a2 
is the sequential composition of actions. ttj+a2 is the disjunction of actions. (|)? is a 
test and it jnst verifies whether (|) holds or not. is a requirement, an action 

requested directly or derived from a requested one by a speaker ts to a hearer th. (a),^ 
is an action that a hearer th is able to do. In this way it is kept track of the agents 
involved in a requirement, either uttered or derived. 



3.5 Representation of Requirements 

Requirements are represented in terms of actions with explicit reference to the 
speaker who demands, and the hearer addressed. 

Because of the dynamic aspect of imperatives, they are associated with the actions 
they prescribe and therefore the dynamic operators are used among them. Thus, the 
sequencing operator (;) models a conjnnction of reqnirements the choice operator (+) 
models a disjunction of reqnirements, and a conditional requirement is represented by 

using the symbol where (c|)^a)=((|)?;a). Following Harel [ 7 ] and Gries [ 5 ] a 
Hoare’s triple P{a}Q can be represented in as P^[a]Q. Thus, P— >[(flXs J 2 is an 
atomic requirement, P^[(a,;a2),^,^]0 is a conjunction of requirements, 
P^[(aj+a2),^ JQ is a disjunction of requirements, and P^[((|)?;a),^ J2 is a 
conditional requirement. 
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3.6 Axioms 

AO) T (any tautology); Al) [(|)?;a]\|/ (|)^[a]\|/; A2) [((|)?;a),J\|/ ^ (|)^[(aXyJ\|/; 

A3) [((|)?;a)J\|/ ^ (|)-^[(a)J\|/; A4) [aj;aJ(|) ^ [aj([aj)(|); AS) [aj+aJ(|) ^ 

[aJ(|)A[aJ(|); A6) [(|)?]\i/ <->■ (|)^\|/; A7) [a]((|)-A\|/) ^ [a](|)^[a]\i/; A8) Vx(|)(x) ^ (|)(t) 
provided that t is free in (|)(x); A9) Vx((|)^\|/) (|)^Vx\|/ provided that x is not free in 

(|). Furthermore we can relate requirements and ability of hearers: shAl) [(a)„,,^]\|/ — > 
[(a)J\|/; hA2) [(a)J\|/ ^ [a]\|/. 

Axioms, from A0)-A7) are standard in Dynamic Logic, A2) and A3) explicitly 
include speakers and hearers and A8)-A9) are standard in predicate logic respectively. 
shAl) is analogous to Chellas (1971: p. 125) axiom where ‘ought’ implies ‘can’ in his 
model of imperatives through Obligation and Permission. Here shAl) expresses that 
if a is demanded for ts to th, is correct, that implies that there is some action, usually 
a sequence (x=a{,a{, ... ;a„ of actions, such that hearer is able to perform it, so that a 
can be satisfied. hA2) emphasizes that any action a hearer is able to do is simply an 
action in the nature. 



3.7 Inference Rules 

a) Modus Ponens (MP): If (|) and (|)— xp then tp; b) Necessitation rule (Nec): If (|) then 
[a](|); c) Universal generalization (UG): If (|) then Vx(|) provided x is not free in (|). 



3.8 Interpretation 

The interpretation for and its soundness follows Harel’s (1979) semantics for 
first-order dynamic logic. A model (/n) for is presented elsewhere [12]. Due to 
the lack of space, we do not repeat the details here. The model uses a possible worlds 
semantics in which actions define sets of pairs of states (w, w’) such that the 
performing of an action starting in state w reaches state w‘. If a state w satisfies a 
formula (|) we use the notation wl=(|). 



3.9 Truth with Respect to a Context 

Let k={(|)j, (|),, ..., (|)_^} represent our context, where for i=l,n, ^.eFOR. We may also 
identify the set of states defined by our context as follows. y\k) = { vv| For every (|)e k, 
wl=(|)}. We use the notation k\=/^ to indicate that (|) is true in the model m with respect 
to context k, for any assignment T. We abbreviate kl=Ai(|) simply as kl=(|). When (|) is not 
true at k under we can write kt?t(|). Thus, if ^^FOR, k\=^ iff for every w if we w’ 

(k) then wl=(|). In this model we are not providing a detailed treatment of beliefs, that 
is why we assume that in expressions involving more than one agent, context simply 
represents a common set of beliefs shared by the agents involved. 
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3.10 Hoare Style Rules 

The following are derived rules, which operate between actions and requirements. Pre 
usually indicates pre-conditions and Pos post-conditions. The equality Pre{<x]Pos= 
Pre^{<x}Pos restricts both sides to hold in the same context. 

(I;) Introduction for composition: 

If vPre ^[aJPos’ and hPoj’^[aJPoi then hPre^[a,;aJPoj. 

(I-h) Introduction for disjunction: 

If \-Pre^[CL^]Pos and hPre^[aJPos then \-Pre^[CL^+CL^Pos. 

(I^) Introduction for conditional: 

If h(PreA(|))^[a]Po^ and h(PreA— i(|))^Pos then hPre^[(|)?;a]Pos. 



3.11 Correctness for Imperatives 

Having all this infrastructure to represent and verify requirement, we can formalize 
the definition of correctness for imperatives. 

Definition: Correctness of an Imperative 

Given a requirement (oc)^^, prescribed by an imperative utterance lmp(A:, P, 

(ot)sh> 2) *^hat Imp is correct w.r.t context k iff A:l=P^[(a)^ for 

appropriate pre and post-conditions P and Q. 

Note that this definition of correctness is only a case of the more general definition 

A:l=P— >[a]Q, which defines the correctness of any action in This includes 

requirements, ability of agents and actions in general. 



3.12 Encapsulating Uttered Requirements 

In order for an agent to distinguish what is uttered from what is not, we encapsulate as 
follows. 

Definition: Set of Requirements 

Let be = <(aj)^^, (oqjs,,, ..., a set of requirements demanded in 

context k, such that ttj, a^, ..., a„ represent actions prescribed by imperatives 
sentences, s and h represents the agents playing the role of speakers and 
hearer respectively. 

Note that allows the distinction between demanded and derived actions. On the 
other hand, there is the implicit assumption that all requirements in Oj are supposed to 
be satisfied as long as is correct. 

Definition: Correctness of a Set of Requirements 

A set is correct with respect to context k iff A:i=P{(aj)j^;(a 2 ),^; 
...;(aJ^^}Q for appropriate pre and post-conditions P and Q. 
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4 Model at Work 



In the examples below we assume that k the context, represent not the set of beliefs of 
a particular agent, but rather the set of common shared beliefs between the hearer and 
speaker involved. 



a) Uttered and Derived Requirements 

Let us assume that Helen says to Betty, Love your neighbour as yourself! Betty 
should be able to encapsulate the requirement such that =<{Love your neighbour as 
yourself!)„^^^^ Betty>- paraphrase the order as a conditional requirement, where 

a(x) = Love x as yourself (|)(x) = x is your neighbour, Q(x) = You love x as yourself 
and P(x)=— iQ(x). Thus, the Hoare’s triple of the imperative is VxP(x)^ 
[(c|)(x)^a(x))jj^j^,, BettylGW- If assume that the requirement is correct w.r.t. k, then 
kl=VxP(x)^ [(c|)(x)^a(x))„^,^,, BettylGW- This means that for both Helen and Betty, the 
requirement according to their l3eliefs is acceptable. If furthermore it is the case that 
(^(Alison) = Alison is your neighbour, then we can derive as follows. 



1) kt=VxP(x)^ 3,,„]e(x) 

2) kNVx (P(X)A^(X))^ [(aW)„e.e„,Beuy]eW 

3) ki=(\){Alison) 

4) k\={P{Alison)/\(Sf{Alison))^ [(a(A/wo«))j,^j,,„ j^^,,^]Q(Alison) 

5) k\=P(Alison)A(^{Alison) 

6) k\= [(a(A/won))„^,^_ QiAlison) 



assumption 

1) , axiom Al) 
assumption 

2) , Univ. Inst. 

3) , 4), Int Conj. 

4) , 5), MP 



In 6) Betty would derive the requirement of loving Alison, from the original 
request by Helen, given that she is one of her neighbours. However that is not an 
uttered requirement, (oc(A/won))„^,^„ O^. 



b) No Choice 

Let us assume that now Helen says to Betty, Talk to the president! Betty would 
distinguish this uttered requirement as follows =<( Talk to the president)-^^^^ Betty>- 
We can paraphrase the order, such that a = Talk to the president, Q = You have talked 
to the president and P=-nQ. Thus, the Hoare’s triple of the imperative is P—> 

BcttyiQ- If assume that the requirement is correct w.r.t. k, then k\=P-^ [(o^)Hden settylG- 
This means that for both Helen and Betty, it is acceptable the requirement of talking 
to the president, according to their beliefs. 

If we assume that [3 = Kill the president, Betty and Helen cannot introduce a 
disjunction such that Betty believes that a choice has been uttered and given to her, 
That is Oj =<(a)„„,„ On the other hand, even a verification of a 

choice might be incorrect, that is kikP—> + (P)Hde„,Be..y]G- Thcrc might be a 

clash between this verification and Betty’s beliefs. 



c) Impossible Requirements 

Let us assume that now Helen says to Betty, Have three arms! Betty would 
distinguish this uttered requirement as follows Oj =<(Have three arms)^^^^„ Betty>- We 
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can paraphrase the order, such that a = Have three arms, Q = You have three and 
P=—iQ. Thus, the Hoare’s triple of the imperative is P— > BenylG- case, 

and under normal circumstances, there would be a clash between this verification and 
Betty’s beliefs. In this case Betty’s clash can be represented by the following 

expression, setty^G’ which means that there is not a state she can reach 

by doing something so that she can have three arms. In terms of ability we can 
express this as ><(a)g^„y>Q, which means that Betty does not believe that she is 
able to perform the action of having three arms. 



5 Conclusions and Future Work 

We have presented a model in which agents that possess a reasoning ability are able 
to interpret imperative sentences. This does not suffer from the inferential problems 
faced by other approaches to the interpretation of imperatives. 

It is assumed that by various means (order, advice, request, etc.) imperatives 
convey requirements. The dynamic aspect of imperatives allows us to envisage that 
the connectives between imperatives behave similarly but not identically to classical 
logic connectives. A set of dynamic operators is used instead (disjunction (+), 
composition (;), conditional imperative (^)). An introduction rule is provided for 
each of these operators. 

The features of the model presented here, are that it captures the main aspects of 
imperatives (including the lack of truth-values), and that it corresponds to our 
intuitions about behavior of imperative sentences. 

The model presented here is useful for verifying imperatives or sequences of 
imperatives, but it is not able to infer new utterances. This distinction between derived 
and uttered requirements allows us to avoid certain paradoxes. 

Propositions and imperatives interact within the model. It allows us to verify the 
appropriate use of imperatives (correctness). Verification of correctness provides a 
legitimation procedure for imperatives, and it is able to detect impossible 
requirements. 

There are many possible extensions for this model, for instance the explicit 
inclusion of time. The introduction of “contrary to duty” imperatives (Prakken and 
Sergot [13]; Alarcon-Cabrera [1]), would be another example. 

In a near future we want to implement this model in a computer system so that it 
can be used in natural language interfaces. At the moment we are working on the 
syntactic analysis of imperatives. 
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Abstract. In this paper, a semiautomatic extension of our monolingual 
(Spanish) TERSEO system to a multilingual level is presented . TERSEO 
implements a method of event ordering based on temporal expression 
recognition and resolution. TERSEO consists of two different modules, 
the first module is based on a set of rules that allows the recognition of 
the temporal expressions in Spanish. The second module is based on a set 
of rules that allows the resolution of these temporal expressions (which 
means transforming them into a concrete date, concrete interval or fuzzy 
interval). Both sets of rules were defined through an empirical study of 
a training corpus. The extension of the system, that makes the system 
able to work with multilingual texts, has been made in five stages. First, 
a direct translation of the temporal expressions in Spanish of our knowl- 
edge database to the target language (English, Italian, French, Catalan, 
etc) is performed. Each expression in the target language is linked to the 
same resolution rule used in the source language. The second stage is a 
search in Google for each expression so that we will eliminate all those 
expressions of which non exact instances are found. The third step is the 
obtaining of a set of keywords in the target language, that will be used 
to look for new temporal expressions in this language, learning new rules 
automatically. Finally, every new rule is linked with its resolution. Be- 
sides, we present two different kinds of evaluations, one of them measures 
the reliability of the system used for the automatic extraction of rules 
for new languages. In the other evaluation, the results of precision and 
recall in the recognition and resolution of Spanish temporal expressions 
are presented. 



1 Introduction 

Temporal information is one of the most relevant data you can obtain from texts, 
in order to establish a chronology between the events of a text. There are three 
different kinds of problems that are necessary to cope with when working with 
temporal expressions: 

* This paper has been supported by the Spanish government, projects FIT-150500- 
2002-244, FIT-150500-2002-416, TIC-2003-07158-C04-01 and TIC2000-0664-C02-02. 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 458-467, 2004. 
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— Identification of temporal expressions 

— Resolution of the temporal expression 

— Event ordering 

At the moment there are different kinds of systems that try to annotate and 
resolve temporal expressions (TE) in different types of corpora: 

— Based on knowledge. These systems have a previous knowledge base that 
contains the rules used to solve the temporal expressions. 

— Based on Machine Learning. In this kind of system, a supervised- 
annotated corpus is needed to automatically generate the system rules that 
can have a percentage of appearance of these rules in corpus. The system is 
based on these rules. 

Within the ones based on knowledge there are works like Wiebe et al.[6], 
that uses a set of defined rules. However, the corpora used in this system are 
scheduling dialogs, in which temporal expressions are limited. Also the system of 
Filatova and Hovy [4] is based on knowledge and describes a method for breaking 
news stories into their constituent events by assigning time-stamps to them. The 
system of Schilder and Habel [9] is knowledge based as well. However, it only 
resolves expressions that refer to the article date and not the ones that refer to 
a previous date in the text. By contrast, some of the most important systems 
based on Machine Learning are, for instance, Wilson et al.[5], Katz and Arosio 
[7], Setzer and Gaizauskas [10]. This latter focuses on annotating Event-Event 
Temporal Relations in text, using a time-event graph which is more complete 
but costly and error-prone. 

Most of the systems described before try to resolve some of the problems 
related with temporal expressions, but not all of them. And, moreover, they 
are usually focused on one language and their adaptation to other languages is 
very complicated. For that reason, in this work, an approach that combines the 
two previous techniques for the multilinguality is presented, with the advantage 
that a manually annotated corpus for new rules learning in other languages is 
not needed. 

This work uses a monolingual temporal expression resolution system based 
on knowledge for Spanish, described in Saquete et al.[2], to turn it into a multi- 
lingual system based on machine learning. 

In order to know the performance of the system of multilingual conversion 
independently to the performance of the resolution system, two different mea- 
surements have been made. On one hand, precision and recall of the new expres- 
sions obtained by means of automatic translation has been measured and on the 
other hand, precision and recall of the monolingual system applied to a Spanish 
corpus is presented. 

This paper has been structured in the following way: first of all, section 2 
shows a short description of the monolingual system that has been presented in 
other articles. Then, section 3 presents the general architecture and description 
of the multilingual system. In section 4, two different kinds of evaluation are 
presented. Finally, in section 5, some conclusions are shown. 
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Fig. 1. Graphic representation of TERSEO 



2 Description of TERSEO System 

In Figure 1 the graphic representation of the monolingual system proposed for 
the recognition of Spanish TEs and for the resolution of its references is shown, 
according to the temporal model proposed. The texts are tagged with lexical 
and morphological information and this information is the input to the temporal 
parser. This temporal parser is implemented using an ascending technique (chart 
parser) and it is based on a temporal grammar. Once the parser recognizes the 
TEs in the text, these are introduced into the resolution unit, which will update 
the value of the reference according to the date it is referring and generate the 
XML tags for each expression. Finally, these tags are the input of a event ordering 
unit that gives back the ordered text. We can find explicit and implicit TEs. The 
grammar in Tables 1 and 2 is used by the parser to discriminate between them. 

There are two types of temporal references that should be treated: the time 
adverbs (i.e. yesterday, tomorrow) and the nominal phrases that are referring to 
temporal relationships (i.e. the day after, the day before). In Table 2 we show 
some of the rules used for the detection of every kind of reference. 



3 Automatic Multilinguality 

In Figure 2 the graphic representation of the extension of TERSEO system is 
shown. The extension of the system consists of five main units: 
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Table 1. Sample of rules for Explicit Dates Recognition 



date— dd+‘/’+nmi+‘/’ + (yy)yy (12/06/1975) 

(06/12/1975) 

date— dd+‘de’+mm+‘de’ + (yy)yy (12 de junio de 1975) 

(12th of June of 1975) 

date— >■ ( ‘El’ )+diasemana+dd+‘ de ’+mes+‘de ’ + (yy)yy 
(El domingo 12 de junio de 1975) 
(Sunday, 12th of June of 1975) 
time— hh+‘ : ’+mm+( ‘ : ’+ss) (time) 



Table 2. Sample of rules for Implicit Dates recognition 



Implicit dates 

referring to Document Date 
Concrete 


reference— > ‘ayer^ (yesterday) 

reference— > ‘manana’ (tomorrow) 

reference— >■ ‘anteayer’ (the day before yesterdary) 
reference— >■ ‘el proximo dia’ (the next day) 


Implicit Dates 
Previous Date Period 


reference— > ‘un mes despues’ (a month later) 

reference— > num+‘anos despues’ (num years later) 


Imp. Dates Prev.Date Concrete 


reference— >■ ‘un dia antes’ (a day before) 


Implicit Dates 
Previous Date Fuzzy 


reference— > ‘dfas despues’ (some days later) 

reference— >■ ‘dias antes’ (some days before) 



— Translation Unit. This unit, using three translators (BabelFish^, Free- 
Translator^ and Power Translator) makes an automatic translation of all the 
expressions in Spanish. 

— Temporal Expression Debugger. These translated expressions will be 
the input of the temporal expression debugger unit, that uses the Google 
resource to eliminate those expressions that have not been translated prop- 
erly. 

— Keyword Unit. These unit obtains a set of keywords in the target language. 

— New Temporal Expression Searching Engine. These unit is able to 
learn new rules automatically using the keywords and an un-annotated cor- 
pus in the target language. 

— Resolution Linker. This unit links every new rule with its resolution. 



3.1 Translation Unit 

The translation unit is in charge to make a direct translation of the temporal 
expressions in Spanish from TERSEO knowledge database to the target lan- 
guage (English, Italian, French, Catalan,etc). For the translation, three machine 
translation systems have been used (BabelFish, FreeTranslator, PowerTransla- 
tor) . The use of three translators is based on a well-known multilingual technique 
used in many multilingual systems [8] . This unit eliminates all those expressions 

^ http://world.altavista.com/ 

^ http:/ /www. free-translator.com/translator3.html 
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Fig. 2. Graphic representation of Automatic Multilingual Extension of TERSEO 



that appear more than once after the translation. Besides, in this unit, each 
expression in the target language is linked to the same resolution rule used in 
the source language. 



3.2 Temporal Expression Debugger 

The confronting of the different results obtained before by the translation unit 
is made by this unit, in order to obtain the best translation for every expression 
and avoid wrong translations, because it is possible that the translator gives a 
wrong translation of an expression as a result. 

All the translated expressions are the input of the Temporal Expression De- 
bugger Unit, that uses Google^as a resource, to eliminate those expressions that 
have not been translated properly. Every exact expression is searched by Google, 
and the expression is considered wrong if Google does not return any coincidence. 



3.3 Keyword Unit 

A third step is the obtaining of a set of keywords in the target language, that 
will be used to look for new temporal expressions in this language. In order to 
obtain keywords, this unit uses, on one hand, all those expressions that have 
been previously translated to the target language and on the other hand, the 
lexical resource WordNet. This resource is used in order to obtain synonymous 

http: / /www.google.com 
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keywords, increasing the set of words and the possibility to obtain new temporal 
expressions using these words. 

3.4 New Temporal Expression Searching Engine 

Using the temporal words obtained before, the new temporal expression search- 
ing engine accedes to a corpus of texts in the target language, giving back new 
temporal expressions. These expressions are not related to any resolution rule 
at first. 



3.5 Resolution Linker 

The Resolution Linker assigns a resolution to each expression based on the char- 
acteristics of the temporal word or words that the expression contains. All these 
expressions are introduced in a knowledge database for the new language. The 
Resolution Linker increases this knowledge database with new rules found in the 
target language. Finally, this database will contain rules in different languages. 
For example, for the general rule: 

Day (Date) -1 /Month. (Date) /Year (Date) 

there is a set of expressions in the source language (Spanish in this case) that are 
related with this rule in the knowledge database and a set of expressions in the 
target language (English in this case) that are related with the same resolution 
rule: 

— Source set: ayer, el pasado dia, el dfa pasado, el ultimo dfa, hace un dfa, 
hace NUM dfas, anteayer, anoche, de ayer, el pasado dfa NumDia, durante 
el dfa de ayer, durante todo el dfa de ayer 

— Target set: yesterday, the past day, the last day, for a day, it does NUM 
days, the day before yesterday, last night, of yesterday, during the day of 
yesterday, the passed day, last day, a day ago, does NUM days, day before 
yesterday, during all the day of yesterday, the day past, makes a day, makes 
NUM days 



3.6 Integration 

Finally, it is necessary to integrate TERSEO system with the automatic exten- 
sion defined, in order to have a complete multilingual system. When the corpus 
is in the target language, the system will use the new knowledge database in this 
language to identify, resolve and order temporal expressions in this corpus. The 
complete system is shown in Figure 3. 
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Multilingual System 



Fig. 3. Graphic representation of the integration of TERSEO and the automatic ex- 
tension 

4 System Evaluation 

In this paper, two different kinds of evaluations are presented. First of all, the 
monolingual system TERSEO has been evaluated, using two Spanish corpora, 
a training corpus and a test corpus. In addition, a second evaluation has been 
made in order to measure the precision and recall of the new rules obtained using 
automatic translation of the source knowledge database. These two evaluations 
have been made separately in order to measure how good the extension of the 
system related to the monolingual system is. 



4.1 TERSEO Evaluation Using a Spanish Corpus 

In order to carry out an evaluation of the monolingual system, a manual anno- 
tation of texts has been made by two human annotators with the purpose of 
comparing it with the automatic annotation that produces the system. For that 
reason, it is necessary to confirm that the manual information is trustworthy 
and it does not alter the results of the experiment. Carletta [3] explains that to 
assure a good annotation is necessary to make a series of direct measurements 
that are: stability, reproducibility and precision, but in addition to these mea- 
surements the reliability must measure the amount of noise in the information. 
The authors argue that due to the amount of agreement by chance that one can 
expected depends on the number of relative frequencies of the categories under 
test, the reliability for the classifications of categories would have to be measure 
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Table 3. Evaluation of the monolingual system 





TRAINING 


TEST 


No Art. 


50 


50 


Real Ref 


238 


199 


Treated 

Ref. 


201 


156 


Successes 


170 


138 


Precision 


84% 


91% 


Recall 


71% 


73% 


Coverage 


84% 


80% 



using the factor kappa defined in Siegel and Castellan [11]. The factor kappa (k) 
measures the affinity in agreement between a set of annotators when they make 
categories judgments. 

In our case, there is only one class of objects and there are three objects 
within this class: objects that refer to the date of the article, objects which refer 
to the previous date and objects that refer to another date different from the 
previous ones. 

After carrying out the calculation, a value fc=0.953 was obtained. According 
to the work of Carletta [3], a measurement of k like 0,68 < k < 0,8 means 
that the conclusions are favorable, and if fc >0,8 means total reliability exists 
between the results of both annotators. Since our value of k is greater than 0,8, 
it is guaranteed that a total reliability in the conducted annotation exists and 
therefore, the results of obtained precision and recall are reliable. 

An evaluation of the module of resolution of TEs was carried out. Two cor- 
pora formed by newspaper articles in Spanish were used. The first set has been 
used for training and it consists of 50 articles, manually annotated by the two 
annotators named before. Thus, after making the opportune adjustments to the 
system, the optimal results of precision and recall were obtained that are in the 
table 3. Although the results in both corpora are very similar, but the scores for 
the test corpus are slightly higher because the number of temporal expressions 
was a bit smaller than in the training corpus. 

4.2 Evaluation of the Automatic Multilingual Extension of the 
System 

In this evaluation, the first two units of the automatic extension have been 
measured. The evaluation was to translate the temporal expressions to English. 
First of all, the translation unit that has an input of 380 temporal expressions in 
Spanish, returns an output of 1140 temporal expressions in English. But, most of 
these expressions are identical because the translation is the same in the three 
automatic translators. Therefore, the translation unit deletes these duplicate 
expressions, and the output of the translation unit are 563 expressions. 

The debugger unit, that uses these expressions as input, returns 443 temporal 
expressions, that according to the Google search are correct expressions. These 
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expressions have been checked manually in order to determine with ones are not 
correct. We consider two types of possible mistakes: 

— The translation of the expression is wrong 

— The resolution assigned to the expression is not correct 

Once these possible mistakes have been analyzed, 13 temporal expressions 
have been classified as incorrect. In conclusion, 430 temporal expressions are 
considered as properly translated and resolved. Considering recall as the number 
of correct translated expressions divided by the number of total expressions, and 
precision as the number of correct translated expressions divided by the number 
of translated expressions, the obtained results are: 

Recall= 430 / 563 = 0.76 — > 76% 

Precision=430 / 443 = 0.97 > 97% 

Some conclusions could be deduced from these results. First of all, a precision 
of 97% has been obtained from the direct translation of the temporal expressions. 
That means the multilingual system have a 81% of precision for the training 
corpus and 88% of precision for the test corpus, a 3% less than the original 
results, which are successful values. 

5 Conclusions 

The approach implemented here combines the advantage of knowledge based 
systems (high precision) with those of systems based on machine learning (facility 
of extension, in this case to other languages) without the disadvantage of needing 
a hand annotated corpus to obtain new rules for the system. 

Moreover, comparing the two evaluations that have been performed, and 
shows that in the precision of the translated temporal expressions only a 3% of 
it is lost, the results of the multilingual system are quite high. 
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Abstract. This paper describes AQUA, an experimental question an- 
swering system. AQUA combines Natural Language Processing (NLP), 
Ontologies, Logic, and Information Retrieval technologies in a uniform 
framework. AQUA makes intensive use of an ontology in several parts of 
the question answering system. The ontology is used in the refinement 
of the initial query, the reasoning process, and in the novel similarity 
algorithm. The similarity algorithm, is a key feature of AQUA. It is used 
to find similarities between relations used in the translated query and 
relations in the ontological structures. 



1 Introduction 

The rise in popularity of the web has created a demand for services which help 
users to find relevant information quickly. One such service is question answering 
(QA), the technique of providing precise answers to specific questions. Given 
a question such as “which country had the highest inflation rate in 2002?” a 
keyword-based search engine such as Google might present the user with web 
pages from the Financial Times, whereas a QA system would attempt to directly 
answer the question with the name of a country. 

On the web, a typical example of a QA system is Jeeves^ [1] which allows 
users to ask questions in natural language. It looks up the user’s question in its 
own database and returns the list of matching questions which it knows how to 
answer. The user then selects the most appropriate entry in the list. Therefore, a 
reasonable aim for an automatic system is to provide textual answers instead of 
a set of documents. In this paper we present AQUA a question answering system 
which amalgamates Natural Language Processing (NLP), Logic, Ontologies and 
Information Retrieval techniques in a uniform framework. 

The first instantiation of our ontology-driven Question Answering System, 
AQUA, is designed to answer questions about academic people and organiza- 
tions. However, an important future target application of AQUA would be to 
answer questions posed within company intranets; for example, giving AQUA 

^ http://www.ask.com/ 
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an ontology of computer systems might allow it to be used for trouble-shooting 
or configuration of computer systems. 

AQUA is also designed to play an important role in the Semantic Web^. One 
of the goals of the Semantic Web is the ability to annotate web resources with 
semantic content. These annotations can then be used by a reasoning system to 
provide intelligent services to users. AQUA would be able to perform incremental 
markup of home pages with semantic content. These annotations can be written 
in RDF [20,13] or RDFS [5], notations which provide a basic framework for 
expressing meta-data on the web. We envision that AQUA can perform the 
markup concurrently with looking for answers, that is, AQUA can annotate 
pages as it finds them. In this way then, semantically annotated web pages can 
be cached to reduce search and processing costs. 

The main contribution of AQUA is the intensive use of an ontology in several 
parts of the question answering system. The ontology is used in the refinement of 
the initial query (query reformulation), the reasoning process, and in the (novel) 
similarity algorithm. The last of these, the similarity algorithm, is a key feature 
of AQUA. It is used to find similarities between relations in the translated query 
and relations in the ontological structures. The similarities detected then allow 
the interchange of concepts or relations in the logic formulae. The ontology is 
used to provide an intelligent reformulation of the question, with the intent to 
reduce the chances of failure to answer the question. 

The paper is organized as follows: Section 2 describes the AQUA process 
model. Section 3 describes the Query Logic Language (QLL) used in the transla- 
tion of the English written questions. Section 4 presents our query satisfaction- 
algorithm used in AQUA. Section 5 describes the similarity algorithm embedded 
in AQUA. Section 6 shows output enhancements of the AQUA system. Section 
7 describes a section of related work. Finally, Section 8 gives conclusions and 
directions for future work. 

2 AQUA Process Model 

The AQUA process model generalizes other approaches by providing a frame- 
work which integrates NLP, Logic, Ontologies and information retrieval. Within 
this work we have focused on creating a process model for the AQUA system 
(Figure 1 shows the architecture of our AQUA system). 

In the process model there are four phases: user interaction, question pro- 
cessing, document processing and answer extraction. 

1. User interaction. The user inputs the question and validates the answer 
(indicates whether it is correct or not). This phase uses the following com- 
ponents: 

— Query interface. The user inputs a question (in English) using the user 
interface (a simple dialogue box). The user can reformulate the query if 
the answer is not satisfactory. 

^ The goal of the Semantic Web is to help users or software agents to organize, locate 
and process content on the WWW. 
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User interaction 



Question processing 



Document processing 




Fig. 1. The AQUA architecture 



— Answer. A ranked set of answers is presented to the user. 

— Answer validation. The user gives feedback to AQUA by indicating agree- 
ment or disagreement with the answer. 

2. Question processing. Question processing is performed in order to under- 
stand the question asked by the user. This ’’understanding” of the question 
requires several steps such as parsing the question, representation of the 
question and classification. The question processing phase uses the following 
components: 

— NLP parser. This segments the sentence into subject, verb, prepositional 
phrases, adjectives and objects. The output of this module is the logic 
representation of the query. 

— Interpreter. This finds a logical proof of the query over the knowledge 
base using Unification and the Resolution algorithm [8]. 

— WordNet/Thesaurus. AQUA’s lexical resource. 

— Ontology. This currently contains people, organizations, research areas, 
projects, publications, technologies and events. 

— Failure- analysis system. This analyzes the failure of a given question and 
gives an explanation of why the query failed. Then the user can provide 
new information for the pending proof, and the proof can be re-started. 
This process can be repeated as needed. 

— Question classification & reformulation. This classifies questions as be- 
longing to any of the types supported in AQUA, {what, who, when, which, 
why and where). This classification is only performed if the proof failed. 
AQUA then tries to use an information retrieval approach. This means 
that AQUA has to perform document processing and answer extraction 
phases. 
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3. Document Processing. A set of documents are selected and a set of para- 
graphs are extracted. This relies on the identification of the focus^ of the 
question. Document processing consists of two components: 

— Search query formulation. This transforms the original question, Q, using 
transformation rules into a new question Q’. Synonymous words can be 
used, punctuation symbols are removed, and words are stemmed. 

— Search engine. This searches the web for a set of documents using a set 
of keywords. 

4. Answer processing. In this phase answers are extracted from passages and 
given a score, using the two components: 

— Passage selection. This extracts passages from the set of documents likely 
to have the answer 

— Answer selection. This clusters answers, scores answers (using a voting 
model), and lastly obtains a final ballot. 

3 Query Logic Language (QLL) 

In this section we present the Query Logic Language (QLL) used within AQUA 
for the translation from of the English question into its Logic form. In QLL 
variables and predicates are assigned types. Also, QLL allows terms (in the 
standard recursively-defined Prolog sense [8,22]). 

Like Prolog or OCML [25], QLL uses unification and resolution [22]. How- 
ever, in the future we plan to use Contextual Resolution [28]. Given a context, 
AQUA could then provide interpretation for sentences containing contextually 
dependent constructs. 

Again like Prolog QLL uses closed-world assumption. So facts that are not 
“provable” are regarded as “false” as opposed to “unknown”. Future work needs 
to be carried out in order to provide QLL with three-valued logic. Once that 
QLL become a three-valued logic language then an evaluation of a predicate 
could produce yes, no or unknown as in Fril [3]. Finally, QLL handles negation 
as failure but it does not use cuts. 

AQUA uses QLL as an inter-media language as is shown in example (sec- 
tion 4). 

The translation rules are used when AQUA is creating the logical form of 
a query, i.e., from grammatical components into QLL. The set of translation 
rules we have devised is not intended to be complete, but it does handle all 
the grammatical components produced by our parser. The form of the logical 
predicates introduced by each syntax category is described in detail in [29]. 

4 The AQUA Query-Satisfaction Algorithm 

This section presents the main algorithm implemented in the AQUA system. 
For the sake of space, we present a condensed version of our algorithm. In this 

® Focus is a word or a sequence of words which defines the question and disambiguates 
it in the sense that it indicates what the question is looking for. 
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following algorithm AQUA uses steps 1-4. 1 to evaluate query over the populated 
AKT reference ontology. Steps 4.2 to 5 are used by AQUA trying to satisfy 
the query using the Web as resource. 

1. Parse the question into its grammatical components such as subject, verb, 
prepositions phrases, object and adjectives. 

2. Use the ontology to convert from the QLL language to a standard predicate 
logic. The ontology is used by a pattern-matching algorithm ® to instantiate 
type variables, and allow them to be replaced with unary predicates. 

3. Re-write the logic formulae using our similarity algorithm (described in the 
next section) 

4. Evaluate/execute the re-written logic formulae over the knowledge base. 

— 4.1 If the logic formulae is satisfied then use it to provide an answer 

— 4.2 else 

• Classify the question as one of the following types: 

* what - specification of objects, activity definition 

* who - person specification 

* when - date 

* which - specification of objects, attributes 

* why - justification of reasons 

* where - geographical location 

• Transform the query Q into a new query Q’ using the important 
keywords. 

• Launch a search engine such as Google ® with the new question Q’. 
AQUA will try to satisfy the user query using other resources such 
as the Web. 

• Analyze retrieved documents which satisfy the query Q’. 

• Perform passage extraction. 

• Perform answer selection. 

• Send answer to user for validation. 

We can see from the algorithm that AQUA tries to satisfy a user query using 
several resources. Future implementations of AQUA could benefit from using 
the results obtained by the Armadillo^ [7] information extraction engine run- 
ning in the background as a complementary knowledge harvester. For instance, 
the user could ask the question What publications has Yorik Wilks produced? 
This is a good example of a query for which AQUA could make a request to 
Armadillo. Armadillo would find the publications of Yorik Wilks (for example, 

^ The AKT ontology contains classes and instances of people, organizations, research 
areas, publications, technologies and events 
(http://akt.open.ac.uk/ocml/domains/akt-support-ontology/) 

® The pattern-matching algorithm tries to find an exact match with names in the 
ontology. 

® http://www.google.com 

^ Armadillo is an information extraction engine which uses resources such as CiteSeer 
to find a limited range of information types such as publications. 
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using CiteSeer® or his personal web site). Next, Armadillo would parse docu- 
ments, retrieve the requested information and pass the set of publications back 
to AQUA to render an appropriate answer. 

The example below shows just one aspect of use of ontology in question 
answering: ‘ontology traversal’ (i.e. generalization/specialization). Suppose we 
have the question: “Which technologies are used in AKT?” AQUA tries to answer 
question as follows. 

The English question is translated into the QLL expression 
use(?x : type technology, akt : type ?y) 

which is then converted to the standard (Prolog-style) expression by using the 
AKT ontology 

3 X : Domain technology(X) A project(akt) A use(X,akt). 

Where the meaning of the quantifier is determined by the type of the variable 
it binds. For our example, let us decide that Domain is the set containing the 
union of of each of the following congruence classes: people, projects, organiza- 
tions, research areas, technologies, publications and events. 

If there is a technology in the AKT project which is defined in the knowl- 
edge base then X will be bound to the name of the technology. Let us imagine 
the scenario where instances of technology are not defined in the AKT refer- 
ence ontology. However, AQUA found in the AKT reference ontology that the 
relation ’’commerciaLtechnology” is a subclass of “technology”. Then commer- 
cial-technology is a particular kind of technology, i.e. 

commercial -technology C technology 

By using the subsumption relation our initial formula is transformed into the 
following one: 

3 X : commercial-technology commerciaLtechnology(X) A project(akt) 
A use(X,akt). 

AQUA then tries to re-satisfy the new question over the knowledge base. 
This time the question succeeds with X instantiated to the name of one of the 
AKT technologies. 



5 Concept and Relation Similarity Algorithm 

The success of the attempt to satisfy a query depends on the existence of a good 
mapping between the names of the relations used in the query and the names of 
the relations used in the knowledge base/ontology. Therefore, we have embedded 
a similarity algorithm in AQUA. Our similarity algorithm uses both ontological 
structures and instances of the selected ontology, the Dice coefficient and the 
WordNet thesaurus. 



http:/ /citeseer.nj.nec.com/cs 
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Our similarity algorithm differs from other similarity algorithms in that it 
uses ontological structures and also instances. Instances provide evidential infor- 
mation about the relations being analyzed. This is an important distinction from 
either the kind of similarity which can be achieved using only either WordNet 
or distances to superclasses (Wu et al [31]). In the former approach, WordNet 
returns all the synsets found, even the ones which are not applicable to the 
problem being solved. On the other hand, in the latter approach, the idea of a 
common superclass between concepts is required. 

For the sake of space, we present a brief explanation of our algorithm when 
arguments in the user query are grounded (instantiated terms) and they match 
exactly (at the level of strings) with instances in the ontology. A detailed de- 
scription of the algorithm and an example can be found in [30] . 

The algorithm uses grounded terms in the user query. It tries to find them 
as instances in the ontology. Once they are located, a portion of the ontology 
(G2) is examined, including neighborhood classes. Then an intersection® (G3) 
between augmented query G1 and ontology G2 is performed to assess structural 
similarity. This is done using knowledge from the ontology. It may be the case 
that, in the intersection G3, several relations include the grounded arguments. 
In this case, the similarity measure is computed for all the relations (containing 
elements of the user query) by using the Dice Goefficient. Finally, the relation 
with the maximum Dice Goefficient value is selected as the most similar relation. 

AQUA reformulates the query using the most similar relation and it then tries 
to prove the reformulated query. If no similarity is achieved using our similarity 
algorithm, AQUA presents the user with the synsets obtained from WordNet. 
From this offered set of synsets, the user can then select the most suitable one. 

6 Output Enhancements 

AQUA not only provides a set of elements which satisfy the query, AQUA en- 
hances its answer using the information from the AKT reference ontology. It 
provides more information about each element in the set. For instance, if the 
query is “who works in AKT?” then AQUA brings additional contextual infor- 
mation such as AKT is a project at KMi and each person of the AKT team is a 
researcher at KMi. 

Validation of answers is a difficult task in general. Therefore, AQUA provides 
a visualization of proofs for supporting validation of answers. Another way, pro- 
vided by AQUA, to help users in validation, could be by enhancing answers with 
extra information. There is ongoing work at KMi on this problem. We use Mag- 
pie [10] in this task of enhancing answers. Magpie is a semantic browser which 
brings information about ontological entities. For instance, it can retrieve home 
pages related to AKT or personal web pages of researchers working in the AKT 
project. All this extra information could be used by our AQUA users in answer 
validation. 

® Intersect means to find a portion in a ontology G 2 which contains all concepts 
contained in Gi (reformulated query) by applying a subsumption relation. 
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7 Related Work 

There are many trends in question answering [16,17,18,27,4,24,14,15,2], how- 
ever, we only describe the systems most closely related to the AQUA system 
philosophy. 

MULDER is a web-based QA system [19] that extracts snippets called sum- 
maries and generates a list of candidate answers. However, unlike AQUA, the 
system does not exploit an inference mechanism, and so, for example, cannot 
use semantic relations from an ontology. 

QUANDA is closest to AQUA in spirit and functionality. QUANDA takes 
questions expressed in English and attempts to provide a short and concise 
answer (a noun phrase or a sentence) [6]. Like AQUA, QUANDA combines 
knowledge representation, information retrieval and natural language processing. 
A question is represented as a logic expression. Also knowledge representation 
techniques are used to represent questions and concepts. However, unlike AQUA, 
QUANDA does not use ontological relations. 

ONTOSEEK is a information retrieval system coupled with an ontology [12]. 
ONTOSEEK performs retrieval based on content instead of string based re- 
trieval. The target was information retrieval with the aim of improving recall 
and precision and the focus was specific classes of information repositories: Yel- 
low Pages and product catalogues. The ONTOSEEK system provides interac- 
tive assistance in query formulation, generalization and specialization. Queries 
are represented as conceptual graphs, then according to the authors “the prob- 
lem is reduced to ontology-driven graph matching where individual nodes and 
arcs match if the ontology indicates that a subsumption relation holds between 
them” . These graphs are not constructed automatically. The ONTOSEEK team 
developed a semi-automatic approach in which the user has to verify the links 
between different nodes in the graph via the designated user interface. 

8 Conclusions and Future Work 

In this paper we have presented AQUA - a question answering system which 
amalgamates NLP, Logic, Information Retrieval techniques and Ontologies 
AQUA translates English questions into logical queries, expressed in a language, 
QLL, that are then used to generate of proofs. Currently AQUA is coupled with 
the AKT reference ontology for the academic domain. In the near future, we 
plan to couple AQUA with other ontologies from our repertoire of ontologies. 

AQUA makes use of an inference engine which is based on the Resolution 
algorithm. However, in future it will be tested with the Contextual Resolution 
algorithm which will allow the carrying of context through several related ques- 
tions. 

We have also presented our similarity algorithm embedded in AQUA which 
uses Ontological structures, the Dice coefficient and WordNet synsets. This al- 
gorithm is used by AQUA to ensure that the question does not fail because of 

AQUA has been implemented in Sicstus Prolog, C, OCML and PHP. 
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a mismatch between names of relations. Future work, intends to provide AQUA 
with a library of similarity algorithms. 

We will also explore the automatic extraction of inference rules, since knowl- 
edge about inference relations between natural language expressions is very im- 
portant for the question answering problem. 
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Abstract. Phrase chnnking can be an effective way to enhance the per- 
formance of an existing parser in machine translation system. This paper 
presents a Chinese phrase chunker implemented nsing transformation- 
based learning algorithm, and an interface devised to convey the de- 
pendency information found by the chunker to the parser. The chunker 
operates as a preprocessor to the parser in a Chinese to Korean machine 
translation system currently under active development. By introducing 
chunking module, some of the unlikely dependencies could be ruled out 
in advance, resulting in noticeable improvements in the parser’s perfor- 
mance. 



1 Introduction 

Traditional natural language parsers aim to produce a single grammatical struc- 
ture for the whole sentence. This full parsing approach has not been generally 
successful in achieving the accuracy, efficiency and robustness required by real 
world applications. Therefore several partial parsing or shallow parsing tech- 
niques have been proposed, where the input sentence is partitioned into a se- 
quence of non-overlapping chunks. Abney [1] introduced the notion of chunks, 
which denote meaningful sequences of words typically consisting of a content 
word surrounded by zero or more function words. Subsequent works have pro- 
posed various chunking techniques, mostly by using machine learning methods 
([2], [3], [4], [5], [6], [7]). The CoNLL-2000 shared text chunking task [8] provided 
an opportunity to compare them with the same corpus data. 

Chunkers or partial parsing systems are able to produce a certain level of 
grammatical information without undertaking the complexity of full parsers, and 
have been argued to be useful for many large-scale natural language processing 
applications such as information extraction and lexical knowledge acquisition 
([9], [10], [11]). However, chunkers cannot be easily integrated into most practical 
machine translation systems. Many existing machine translation systems are 
based on the transfer model, where the transfer modules usually assume fully 
parsed structures as their input. Moreover, the output of partial parsers do not 
generally provide grammatical relations among words within chunks. 

In this paper, we adopt chunking as a preprocessing step to full parsing. 
The chunking module attempts to identify useful chunks in input sentences and 
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then provides the chunk information to the next stage by restricting the depen- 
dency relations among words. This way the chunking module could be easily 
integrated into the conventional parsing system while considerably improving 
the performance. 



2 Phrase Chunking Based on Transformation-Based 
Learning 

The target application environment of this study is an ongoing project on Chi- 
nese to Korean machine translation system, whose objective is to provide real- 
time online translation of general-domain web documents. The phrase chunking 
module is used to recognize useful chunks from Chinese text and operates as a 
preprocesser to the full-scale Chinese parser. 



2.1 Phrase Chunk Types 

Chunks are non-recursive and non-overlapping groups of continuous words. The 
following example shows a Chinese sentence (represented in Pinyin codes) whose 
chunks are marked by brackets with their chunk types. 

[NP gongan xiaofdng “119 ” jiejingtdi] [NP Imgchen] [QP ^ shi ^3 fen] 

[VP jieddo] [NP hdojing ] , [NP P xidofdngche] [VP jtsht gdnddo] [NP 
huozdi xidnchdng] , [QP 3 shi 30 fen] [NP ddhuo] [VP hei pumie] . 

{The “119” public fire department received a call at 2:43 early morning, 

9 fire engines were immediately dispatched to the scene, and the big fire 
was fully under control at 3:30.) 

The chunk types are based on the phrase tags used in parse trees, and the 
names of the phrase tags used in this study are similar to those in Penn Chinese 
Treebank [12[. For example, the chunk type NP corresponds to the phrase label 
NP. However, since chunks are non-recursive, a phrase in parse tree will be 
typically broken into several chunks in many cases, and higher-level phrase tags 
such as S will not have corresponding chunk types. Note that it is also possible 
some words do not belong to any chunks. 

The list of Chinese chunk types this study attempts to identify is as follows: 

— ADJP : Adjectival Phrases 

— AD VP : Adverbial Phrases 

— NP : Noun Phrases 

— QP : Quantifier Phrases 

— VP : Verbal Phrases 

There is always one syntactic head in each chunk: AJ (adjective) in ADJP, NN 
(noun) in NP, and so on. A chunk normally contains premodifiers to the left of 
its head, but not postmodifiers or arguments. Therefore, the syntactic head is 
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the last element of the chunk for most cases. The exceptional cases are when a 
chunk is enclosed by a pair of matching parentheses or quotation marks. These 
symbols are regarded as part of the chunks. Some of the VP chunks can also 
be exceptional if two consecutive verbs form verh-resultative or verb- directional 
compounds where the second verbs indicate the result or direction of the first 
verbs. In such cases, the head can be at a position other than last. In the case of 
coordination, the last conjoined elements are assumed to be the syntactic heads. 

QP represents quantifier phrases consisting of determiners or numbers fol- 
lowed by measure words. A QP can be contained in an NP chunk as a premodifier 
like the QP “9 bu” in [NP 9 bu xidofangche] {9 fire engines) in the previous ex- 
ample, but a QP is considered as a chunk of its own in other cases. 

Note that no chunks of type PP (prepositional phrases), for example, will be 
recognized. As a constituent, a prepositional phrase will consist of a preposition 
followed by its object phrase. Because the object phrase itself will form one 
or more chunk phrases, the PP chunks will almost always consist of one word 
which is the preposition. We chose not to include a chunk type if the chunks 
of the type consist of only one word, because the main objective of chunking in 
this study is to assist parsing and these one-word chunks will provide no useful 
dependency constraints. The same is true for many other phrases such as LCP 
(phrase headed by localizer LC), CP (relative clause headed by complementizer 
DE), and so on. 

The chunk-annotated corpuses that were used in the experiments, were au- 
tomatically constructed by extracting chunks from parse trees. For example, an 
NP chunk may include the syntactic head and all of its premodifiers. But if one 
of the premodifiers has its own NP or VP constituents, then the chunk should 
be broken into several smaller chunks, and some words may end up not belong- 
ing to any chunks. This automatic conversion process may introduce a certain 
percentage of errors, but it will not be a serious problem for a learning-based 
approach. Currently no chunk-annotated corpus of enough size is available. 

2.2 Transformation-Based Learning 

The phrase chunking system in this study was developed using transformation- 
based learning framework first proposed by Brill [13]. It is basically a non- 
statistical corpus-based learning technique, where an ordered set of transfor- 
mation rules is acquired from an annotated training corpus. The idea is to start 
with some simple solution to the problem and apply transformations iteratively, 
selecting at each step the transformation which results in the largest benefit. 

The chunk tags were encoded by CoNLL-2000 style representation [8] (origi- 
nally from [5]) where chunk tags {1 , 0, B} are combined with chunk types. For 
example, B-NP denotes an initial word of an NP chunk, I -VP a non-initial word 
of a VP chunk and 0 a word outside any chunks. 

The words in training corpus are annotated with these chunk tags as well as 
the part of speech tags. The learning process proceeds as follows. First, some 
simple baseline algorithm assigns initial chunk tags to the words in training 
corpus. One way to do this is to assign chunk tag Y to every word which has 
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POS tag X, if Y was the most frequent chunk tag of POS tag X in the training 
corpus. After the baseline assignment, all possible transformations that could 
possibly improve the current chunk tag assignment of the training corpus, are 
inspected. A rule that corrects a wrong chunk tag in one place of the training 
corpus may also change the correct tags to the wrong in other places. Therefore, 
the net contribution of the candidate rules should be computed. One with the 
largest benefit is selected, applied to the corpus and added to the sequence of 
learned transformation rules. The entire process is then repeated on the updated 
corpus, until there are no more transformations or the net contribution of the 
selected transformation is smaller than a pre-defined threshold. 

In order to generate possible transformations at each step of the learning 
process, a set of rule templates are used to describe the kind of rules the system 
will try to learn. The templates determine which features are checked when 
a transformation is proposed. For example, the following is an example of a 
template, which defines a rule that will change the chunk tag based on the 
current chunk tag and the POS tags of three preceding words. 

chunk_0 pos_-3 pos_-2 pos_-l => chunk 

The candidate transformation rules are then generated by instantiating the tem- 
plates with actual values from the corpus. The features used in this study are as 
follows. 

— neighboring words 

— POS tags of neighboring words 

— chunk tags of neighboring words 

In the experiments, the neighboring context was limited by the range [-3, -1-3], 
that is, the features of at most three words to the left or to the right can be 
seen by the transformation rules. The context range defines the size of the search 
space for candidate rules, and therefore the amount of computation for learning 
is heavily dependent on the context size. Some rules also check the presence of 
a particular feature in a certain range. These basic features are systematically 
combined to produce rule templates. 

2.3 Experimental Result 

A Chinese phrase chunking system was implemented using the Fast Transforma- 
tion-Based Learning Toolkit [14]. A total of 150 rule templates were devised that 
use the features explained above. Since no chunk-annotated Chinese corpus of 
enough size is currently available, we converted the parse trees generated by our 
parser into training corpus. Though the parse trees contain a certain percentage 
of errors, we could still expect to get a reasonably useful result by the learning 
process. A small-scale experimental comparison confirmed this assumption. For 
this purpose, two different sets of small-sized corpuses were prepared: Mtrain 
(45,684 tokens) and Mtest (10,626 tokens) were made from the tree-annotated 
corpus manually corrected by native Chinese speakers while Atrain (45,848 
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Table 1. Chunking result (pilot version) 



Training corpus 


Test corpus 


Precision Recall 


Fj3=l 


Mtrain 


Mtest 


94.88 


93.91 


94.39 


Atrain 


Atest 


93.79 


92.08 


92.93 


Atrain 


Mtest 


95.07 


93.91 


94.49 



Table 2. Chunking result 





Precision Recall 


7 / 3=1 


Baseline 


40.64 


63.53 


49.56 


Final 


95.75 


94.13 


94.93 



tokens) and Atest (10,757 tokens) were directly converted from the parse trees 
generated by parser. The stopping threshold for the learning process was one, 
that is, the algorithm stopped when a rule with net score one is reached. 

Table 1 shows the results measured with the usual precision, recall and 
rates. The first result was obtained with manually corrected training and test 
corpuses, while the second one was with the corpuses from the ‘raw’ parse trees. 
The numbers of acquired rules were 353 and 399, respectively. In the third exper- 
iment, the rules acquired from Atrain were applied to Mtest. Therefore, in the 
first and third experiments, the same Mtest was used to compare the transfor- 
mation rules acquired from Mtrain and Atrain. The result does not show any 
serious performance drop when using the possibly inferior Atrain as training 
corpus. In fact, contrary to expectation, the rules acquired from Atrain per- 
formed slightly better on the Mtest corpus than the rules from Mtrain. The fact 
that considerably more rules were acquired from Atrain than Mtrain may be 
a factor here, but the difference does not seem significant. However, the result 
does suggest that the Mtest corpus is somehow ‘easier’ than Atest from the 
viewpoint of chunking. 

For a more extensive experiment, a training corpus of 157,835 tokens and 
a test corpus of 45,879 tokens were used. Both corpuses consist of sentences 
gathered from various Chinese websites, and were automatically converted from 
the parse trees generated by the parser. The same 150 templates were used and 
the stopping threshold this time was three. The number of acquired rules was 
303. Table 2 shows the baseline and final result of this experiment. 

The result shows about 2 % of improvement in F-measure over the earlier 
Atest experiment with smaller training corpus. This result is slightly better than 
the published result of CoNLL-2000 shared chunking task [8], but it should be 
noted that the results can not be directly compared since the target domain and 
experimental details are totally different. Moreover, the formulation of problem 
itself seems to be relatively easier in our case. For one thing, the number of chunk 
types to be identified is smaller, because the target application of this study is 
more restricted. 




Phrase Chunking for Efficient Parsing in Machine Translation System 



483 



sentence 




parse tree 



Fig. 1. Structure of parsing system 

3 Prom Chunking to Parsing 

Figure 1 shows the structure of the Chinese analysis system in our machine 
translation system. In order to handle the excessive length of the real world 
sentences, the system attempts to split the sentence into a sequence of clauses 
based on punctuation marks and other context information. After then, chunks 
are identified for each clause and the relevant information is handed over to 
chart-based parser module. If more than one trees are produced for a clause, the 
tree selection module selects one of them based on various knowledge sources. 

3.1 Chunks and Dependency Information 

Parser can benefit from the information obtained by phrase chunker. However, 
most machine learning-based chunking processes identify only the start and end 
positions of the chunks, without their internal structures. Therefore, it is not 
possible to use the chunks directly in parsing. We devised an interface that 
effectively conveys the dependency information made available by chunking to 
the parser. 

Dependency relation of a sentence can be defined as n x n matrix Dep, where 
n is the number of words in the sentence. Dep[i,j] represents the dependency 
between f-th and j-th words. If i-th word is a dependent (that is, a complement or 
a modifier) of /c-th word, Dep[i, k] = 1. It implies Dep[i,j] = 0 for all j yf k, since 
a word cannot have more than one governor in our grammar. If an identified 
chunk corresponds to a constituent in a parsing tree, then the words inside the 
chunk except its head are disallowed to have dependency relation with any words 
outside the chunk. More precisely, if a sentence has n words, w\, . . . ,w„, and a 
chunk C[c : k] spanning from Wc to Wk with its head Wh, 1 < c < h < k < n, \s 
found, then Dep[i,j] = 0 for all z, c < t < fc, i yf ft-, and for all j, 1 < j < c,k < 
j < n. Figure 2 shows the status of the dependency matrix Dep when the chunk 
is recognized. The cells marked zero means there will be no dependency relation 
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1 ... c 

- ? 0 

? - 0 

0 0 - 

0 0 ? 
? ? 0 

0 0 ? 
0 0 ? 
? ? 0 

? ? 0 
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0 ? 0 0 ? 

0 ? 0 0 ? 

? ? ? ? 0 

- ? ? ? 0 

0 - 00 ? 
? ? - ? 0 

? ? ? - 0 

0 ? 00 - 
0 ? 0 0 ? 



n 

y 

? 

0 

0 

? 

0 

0 

? 



Fig. 2. Dependency matrix 



between the two words. This matrix is consulted during parsing to prevent pairs 
of structures with disabled dependency from being combined. 

The syntactic head of a chunk can be found in most cases by inspecting the 
POS tag sequence. The head of an NP chunk should be the last noun element in 
the chunk, the head of a QP chunk should be the last measure word (MW), and 
so on. Therefore, if a chunk corresponding to a constituent is recognized and 
its syntactic head is known, the dependency constraints can be automatically 
constructed. 

However, there are problematic cases such as VP chunks with more than 
one verb elements. The simple method explained above cannot be applied to 
these VP chunks for two reasons. First, VP chunks do not always correspond 
to constituents. For example, a transitive verb may take a noun phrase as its 
object, which will become an NP chunk of its own. So, the VP constituent in 
this case will be broken into two distinct chunks: VP and NP. The VP chunks 
often consist of the remaining elements of verb phrases from which other chunks 
are extracted, and therefore do not generally correspond to any constituents. 
If a chunk does not correspond to a constituent, the non-head elements of the 
chunk can have dependency relations with words outside the chunk. Secondly, 
for the VP chunks with more than one verbs, the syntactic head cannot be 
determined just by inspecting the POS tag sequence because verbs can have 
post-modifiers. For example, in a VP chunk with tag sequence “AD VV VV” , the 
first verb could be an adverbial modifier to the second one, or the second verb 
could be a directional or resultative modifier to the first one. For the VP chunks 
with more than one verb elements, the above two reasons make it impossible to 
construct dependency constraints such as Fig. 2. 

The first problem can be avoided if we stipulate VP chunks should correspond 
to constituents. This approach is not desirable and inappropriate for shallow 
parsing application because it makes the chunk definition recursive, resulting 
in loss of efficiency. The second problem can also be avoided if we split verb- 
resultative or verb-directional constructions into two separate VP chunks. Then, 
the syntactic heads of VP chunks will be uniquely determined by tag sequence. 
However, since this verb-modifier construction is highly productive and regarded 
as a lexical process in Chinese, it is not desirable to recognize them as separate 
chunks. 
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For each sentence [l..n] do the following: 

1. Initialize dependency matrix Dep\ 

2. Apply chunking rules; 

3. For each recognized chunk C[c..k] do the following: 

if its chunk type C^VP, or C = VP and C has only one verb then 

- Locate syntactic head h of C; 

- for all i,c < i < k,i ^ h, and for all j,l<j<c,k<j<n do Dep[i,j] = 

0; 

else (that is, C = ’FP and C has more than one verb) 

- for all i, c < i < k,T AG{wi) VV, and for all j, 1 < J < c, fe < j < n do 
Dep[i,j] = 0. 

Fig. 3. The phrase chunking algorithm 

It might seem that these VP chunks are of no use for the purpose of this 
study. However, some of the dependency constraints, if restricted, can still be 
obtained for VP chunks with more than one verb elements. For the VP chunk 
of “AD VV VV” , the head of the first word AD should be found inside the chunk. 
Therefore, the dependency relation between this adverb and all the other words 
outside the chunk can be safely set to zero. More generally, for a VP chunk 
with more than one verbal elements, for all non-verbal elements Wi of the chunk, 
Dep[i,j] can be set to zero for all Wj outside the chunk. 

In summary, the phrase chunker works as follows. After the input sentence 
goes through the morphological analysis and clause splitting stages, the chunk- 
ing rules are applied and a set of chunks are recognized. For each chunk thus 
obtained, the dependency matrix Dep is updated according to its chunk type. 
This procedure is summarized in Fig. 3. 



Table 3. Parsing result 





Without chunking 


With chunking 


Difference 


Accuracy 


90.35% 


90.87% 


-bO.52% 


Avg ^ of trees per clause 


6.1 


3.8 


-37.8% 


# of parsed words per second 


866.3 


1012.8 


-kl6.9% 


Avg memory usage per clause 


237.2 KB 


184.0 KB 


-22.4% 



3.2 Experimental Result 

The phrase chunking module was inserted into the Chinese-Korean machine 
translation system right before the parser, and an experiment was performed 
with a test corpus (10,160 tokens) which was previously unseen by the chunking 
algorithm. Table 3 compares the results of parsing with and without chunking 
module. With chunking module added, the average number of trees was reduced 
drastically, and considerable improvement in both speed and memory usage was 
obtained. The speed measured by the number of parsed words per second in- 
creased by about 17 percent, and the average size of the total memory allocated 




486 J. Yang 



by the parser to parse a clause decreased by about 22 percent.^ The accuracy 
was calculated as the ratio of the right dependency links to the total number 
of links, by comparing the parsed trees to the correct trees manually prepared 
by native Chinese speakers. A sentence of n words has n dependency links (in- 
cluding the root of a tree which should have no governor), and for each word 
in a parsed tree, the link was regarded right if it has the same governor as in 
the corresponding correct tree. The result with chunking module showed about 
0.5% improvement, which is a relatively minute improvement. We assume this is 
because the tree selection module currently employed in the parser is relatively 
reliable. Still, the improvement in efficiency alone makes the chunking module a 
worthwhile addition to the existing machine translation system. 

4 Conclusion 

This paper presented an approach to phrase chunking as a way to improve the 
performance of an existing parser in machine translation system. In order to 
accomplish this, an appropriate interface between the chunker and the parser is 
necessary so that the chunk information obtained by the chunker can be effec- 
tively used by the parser. We have implemented a Chinese phrase chunker using 
transformation-based learning algorithm, and integrated it into the Chinese- 
Korean machine translation system. The experimental results are promising, 
with improvements in parsing efficiency and accuracy. Especially, considering the 
excessive complexity of real world sentences that a practical general-domain nat- 
ural language processing system should cope with, the noticeable improvement 
in parsing efficiency could be quite useful in implementing practical applications. 
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Abstract. This paper describes the foundations and algorithms of a new 
alternative to improve the connectivity of the configuration space using 
probabilistic roadmap methods (PRM). The main idea is to use some 
geometric features about the obstacles and the robot into work-space, 
and to obtain useful configurations close to the obstacles to find collision 
free paths. In order to reach a better performance of these planners, we 
use the “straightness” feature and propose a new heuristic which allows 
us to solve the narrow corridor problems. We apply this technique to 
solve the motion planning problems for a specific kind of robots, “free 
flying objects” with six degrees of freedom (dof), three degrees used for 
position and the last three used for orientation. We have implemented 
the method in three dimensional space and we show results that allow 
us to be sure that some geometric features on the work-space can be 
used to improve the connectivity of configuration space and to solve the 
narrow corridor problems. 

Keywords: Probabilistic roadmap methods, motion planning, geometric 
features, robotics. 



1 Introduction 

We present a new planning method which computes collision-free paths for robots 
of virtually any type moving among stationary obstacles (static workspace). 
However, our method is particularly interesting for robots with six-dof. Indeed, 
an increasing number of practical problems involve such robots. The main con- 
tribution in this new proposal is an heuristic which attempts to compute useful 
configurations close to the obstacles, searching to improve the connectivity of 
the PRMs. A basic definition when we are talking about probabilistic roadmap 
methods is the concept of configuration. A configuration is a tuple with a specific 
number of parameters which are used to determine the position and orientation 
of an object. The method proceeds in two phases: a construction phase and a 
query phase. In the construction phase a probabilistic roadmap is constructed by 
repeatedly generating random free configurations (using spheres which surround 
the robot and the obstacles) of the robot and connecting these configurations 
using some simple, but very fast local planner. The roadmap thus formed in the 
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free configuration space (C-Space [12]) of the robot is stored as an undirected 
graph R. The configurations are the nodes of R and the paths computed by 
the local planner are the edges of R. The construction phase is completed 
by some postprocessing of R to improve its connectivity. That improving is 
computed using the “straightness” geometric feature on the workspace. This 
feature indicates the direction along which the object presents its long side. 
Following the construction phase, multiple queries can be answered. A query 
ask for a path between two free configurations of the robot. To process a query 
the method first attempt to find a path from the start and goal configurations 
to two nodes of the roadmap. Next, a graph search is done to find a sequence of 
edges connecting these nodes in the roadmap. Concatenations of the successive 
path segments transform the sequence found into a feasible path for the robot. 

2 Related Work 

Probabilistic roadmap planners (PRMs) construct a network of simple paths 
(usually straight paths in the configuration space) connecting collision-free con- 
figurations picked at random [1,3,4,7,15,8,10,14,13]. They have been successful 
in solving difficult path planning problems. 

An important issue in PRM planners is the method for choosing the random 
configurations for the construction of the roadmaps. Recent works have consid- 
ered many alternatives to a uniform random distribution of configurations as 
means for dealing with the narrow passage problem. A resampling step, creating 
additional nodes in the vicinity of nodes that are connected with a few others, 
is shown in [8]. Nodes close to the surface of the obstacles are added in [2]. A 
dilation of the configuration space has been suggested in [6], as well as an in 
depth analysis of the narrow passage problem. In [16] a procedure for retracting 
configurations onto the free space medial axis is presented. In [5] a probabilistic 
method for choosing configurations close to the obstacles is presented. 



3 Preliminaries and Notation 

The moving objects (robot) considered in this paper are rigid objects in three- 
dimensional space. We present configurations using six-tuples {x,y,z,a, P,S), 
where the first three coordinates define the position and the last three define 
the orientation. The orientation coordinates are represented in radians. 

In addition to collision detection, all PRMs make heavy use of so-called local 
planners and distance computation. Local planners are simple, fast, deterministic 
method used to make connections between roadmap nodes when building the 
roadmap, and to connect the start and goal configurations to the roadmap during 
queries. Distance metrics are used to determine which pairs of nodes one should 
try to connect. 
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Let B = {i?o, be a set of obstacles in the workspace W. Let be 

the radius associate to the sphere which involve each body in the workspace 
(including the robot). 

Let di be the distance between two configurations, one of them associate to 
the robot and the another one associate to the obstacle Bi. 

Let c{Bi) be the random configuration computed by the heuristic using the 
obstacle Bi. Let q be the configuration c{Bi) added to the roadmap and the 
roadmap will be denoted by R. Let Vi be the direction vector which define the 
“straightness” feature for each Bi G B, and vvi will denote the same feature on 
the robot. This feature indicates the direction which the body presents its long 
side. 

The symbol || will be used to show that two configurations c{Bi) and c{Bj) 
are in parallel, that means that, the bodies associated to each configuration keep 
in parallel their direction vectors. 



4 Algorithm Description 

We now describe the planner in general terms for a specific type of robot, the 
free flying objects. During the preprocessing phase a data structure called the 
roadmap is constructed in a probabilistic way for a given scene. The roadmap is 
an undirected graph R = (N,E). The nodes in N are a set of configurations of 
the robot appropriately chosen over the free C-space {C free)- The edges in E cor- 
respond to (simple) paths; an edge between two nodes corresponds to a feasible 
path connecting the relevant configurations. These paths are computed by an ex- 
tremely fast, though not very powerful planner, called the local planner. During 
the query phase, the roadmap is used to solve individual path planning prob- 
lems in the input scene. Given a start configuration qinit and a goal configuration 
qgoai, the method first tries to connect qinu and qgoai to some two nodes and 
d'goai If successful, it then searches R for a sequence of edges in E connect- 
ing q'iriit to q'goai- Finally, it transforms this sequence into a feasible path for the 
robot by recomputing the corresponding local paths and concatenating them. 



4.1 The Construction Phase 

The construction phase consists of two successive steps: first approximation of 
the roadmap and expanding the roadmap. The objective of the first approxima- 
tion of the roadmap is to obtain an approximation using an economic and fast 
process. The expanding roadmap step is aimed at further improving the con- 
nectivity of this graph. It selects nodes of R which, according to some heuristic 
evaluator, lie in difficult regions of Cfree and expand the graph by generating 
additional nodes in their neighborhoods. 
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First Approximation of the Roadmap. Initially the graph R = (iV, E) is 
empty. Then, repeatedly, a random free configuration is generated and added to 
N. For every such new node q, we select a number of nodes from the current 
N and try to connect q to each of them using the local planner. Whenever this 
planner succeeds to compute a feasible path between q and selected node q', the 
edge {q, q') is added to E. To make our presentation more precise: 

Let LP be the local planner which return {0,1} depending if the local planner 
can compute a path between two configurations given as arguments. 

Let D be the distance function, defining a pseudo- metric in C — space. 

The roadmap first approximation algorithm can be outlined as follows: 

First Approximation 

1. N < — 0 

2. fork = 1 to GTE .NODES 

3. q i — a randomly chosen free configuration 

4. if (q is not in collision) then 

5. N i — NO{q} 

6. k=k+l 
7. endfor 

Some choices for the steps of above algorithm are still unspecified. In par- 
ticular, we need to define how random configurations are created in step 3, and 
which function is used to collision detection in 4. 

Creation of random configurations. The nodes of R should constitute a rather 
uniform sampling of C free- Every such configuration is obtained by drawing each 
of its coordinates from the interval of allowed values of the corresponding dof us- 
ing uniform probability distribution over this interval. This sampling is computed 
using spheres which surround the objects. The figure 1 shows this sampling. 

Collision detection. Only during this process, the collision detection function is 
implemented using spheres, that means that, we surround the robot and the 
obstacles into a sphere and the collision verification is reduced to compute if two 
spheres are in collision. Therefore, this process can be computed quickly. 



Expanding the Roadmap. If the number of nodes computed during the first 
approximation of the roadmap is large enough, the set N gives a fairly uniform 
covering of Cjree- In easy scenes R is well connected. But in more constrained 
ones where Cfree is actually connected, R often consists of a few large com- 
ponents and several small ones. It therefore does not effectively capture the 
connectivity of Cfree- 
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Fig. 1. First approximation using Fig. 2. The rotation axis is defined in 

spheres surrounding the robot and the same direction of the straightness 

computing a reduced number of con- feature, 

figurations. 

The purpose of the expansion is to add more nodes in a way that will facil- 
itate the formation of the large components comprising as many of the nodes 
as possible and will also help cover the more difficult narrow parts of Cfree- 
The identification of these difficult parts of Cfree is no simpler matter and the 
heuristic that we propose below goes only a certain distance in this direction. 

First, the heuristic attempt to take advantage of a geometric feature of the 
robot and the obstacles to obtain information that allows guide the search for 
useful configurations, and reduce the account of non-valid configurations to be 
calculated. In order to reach a better performance of these planners we will 
obtain a better representation of the connectivity of the configuration space. 

The first feature used is called “Straightness”. This feature indicates the 
direction which the body presents its long side and it is given by a vector Vi, 
which we have used as the direction of the rotation axis. Figure 2. shows how this 
feature is used to define the rotation axis. Next, we begin by calculating a random 
configuration c{Bi) near to each obstacle Bi . If the c{Bi) is in collision, then the 
heuristic attempts to move this configuration to turn it in a free configuration, 
which will be close to some obstacle. (Amato and Wu in [3], propose an algorithm 
which compute configurations on C - Obstacle surfaces). 

Parallel Configurations. Once the collision configuration c{Bi) has been 
found, the first strategy is to rotate it until the rotation axis of the robot Vri will 
be parallel to the rotation axis Vi of the obstacle Bi, that means that, Vri\\vi. The 
configuration obtained can be seen in the Figure 3., where we can see different 
parallel configurations. If the new c{Bi) is not in collision then it is added to the 
roadmap, if it is in collision a process called elastic band is applied on c{Bi). 

Perpendicular Configurations. The second strategy is to rotate it until the 
rotation axis of the robot Vri will be perpendicular to the rotation axis Vi of the 
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obstacle Bi, that means that, Vri -L fi- The configuration computed can be seen 
in the Figure 3. If the new c{Bi) is not in collision then is added to the roadmap, 
if it is in collision the elastic band process is applied on c{Bi). 



The Elastic Band Heuristic. This process works as follows, first it calculates 
the distance vector di between the obstacle position and the configuration c{Bi), 
and attempts to approach and move away the robot from the obstacle. While the 
band is working (scaling the di vector to compute the next position where the 
c{Bi) is going to be placed) the robot is rotated on its rotation axis, searching a 
free configuration. 

The first criteria tries to move and rotate the configuration in a smooth way, 
searching to keep the rotation axis of the robot parallel to the rotation axis of the 
obstacle where there was collision. The technique works like a band which was 
described in the previous section. While the elastic band process is computed, the 
robot is rotated on Vri, this heuristic imagine that if the “straightness” feature 
was defined as in notation section then, when we try to rotate the robot about 
Vri the swept area will be the lowest. The Figure 4. shows how the algorithm 
works, searching for calculating configuration around the obstacle and close to 
it. Into the figure, Vi and Vri represents the rotation axis of the obstacle and 
the robot respectively, and di represents the distance vector which the heuristic 
scale to reach a free configuration. The next algorithm presents the elastic band 
process. 

Elastic Band Heuristic 

1. ri = radius of i- obstacle 

2. cLinit = position of the i-obstacle 

3. scalc-value = 0, angle = 0, k = 0; 

4- di = distance vector between i-obstacle and robot positions 

5. do 

scalc-value = random between (0.5 and 1.5) 
scale(di, scalc-value) 
angle = random between ( 0,2H ) 
rotatc-robot-on-rotation-axis (angle) 
c{Bi)=get-Configuration-with_position (di) 
k = k 1 

6. while ( c{Bi) is not free configuration and k < CTE ) 

1. add-Configuration c{Bi) to the roadmap 

The second criteria is to get a perpendicular configuration to the obstacle, 
this option can be seen in Figure 3. In the same way that parallel configuration, 
the perpendicular configuration is added to the roadmap if and only if it is not 
in collision, otherwise the elastic band process is applied on it. 

In the elastic band heuritic presented before, there are two undefined oper- 
ations, the first one is the scalc-value parameter, this is used to store the value 
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Fig. 3. Parallel and perpendicular con- Fig. 4. The elastic band heuristic 

figurations around the obstacle 

which the distance vector will be scaled to obtain the next position of the config- 
uration c{Bi), and the second on is the scale function which compute the product 
between the vector di, and the scalc-value. 

4.2 The Query Phase 

During the query phase, paths are to be found between arbitrary input start 
and goal configurations qinit and qgoah using the roadmap computed in the 
construction phase. Assume for the moment that Cfree is connected and that 
the roadmap consists of a single connected component R. We try to connect 
qinit and qgoai to some two nodes of R, respectively and q'gnaij with feasible 
paths Pinit and Pgoai- If this fails, the query fails. Otherwise, we compute a path 
P in R connecting qi^^^ to q'gnni ■ ^ feasible path from g to q'gnni is eventually 
constructed by concatenating Pinit, the recomputed path corresponding to P, 
and Pgoai reversed. 

The main question is how to compute the paths Pinit and Pgoai ■ The queries 
should preferably terminate quasi-instantaneously, so no expensive algorithm is 
desired here. Our strategy for connecting q^n to R is to consider the nodes in R 
in order of increasing distance from q^u and try to connect qinit to each of them 
with the local planner, until one connection is succeeds or until an allocated 
amount of time has elapsed. 

5 Experimental Results 

We demonstrate the application of the planner explaining four examples. The 
planner is implemented in C-|— I- and it is working in three dimensional space, 
for experiments reported here we used an Intel Pentium 4 CPU 2.4 GHz and 
512MB in RAM. 

In the following, we analyze the performance of the method (this performance 
is seen since the capability of the method to solve the problems) on several scenes. 
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Fig. 5. Scene 1. The robot is a tetrahe- 
dro, and the obstacles are fonr prisms. 




Fig. 7. Scene 3. The robot is a cube 
passing through a corridor 




Fig. 6. Scene 2. The robot is a large 
tetrahedro, and the obstacles are five 
prisms. 




Fig. 8. Scene 4. The robot has a more 
complicated form 



In all cases we have used a free-flying object robot with six degrees of freedom. 
The various environments, and some representative configurations of the robot, 
are shown in Figures 5,6,7 and 8. Note that the roadmap size is influenced by the 
number of obstacles in the work-space since a set of roadmap nodes is generated 
for each obstacle, i.e., the size of the network is related to the complexity of the 
environment. The four samples shown are presented as result of the technique 
applied to the problems. We present four problems, they have different difficult 
level. The problems are labeled as El, E2, E3 and E4. 



El: This environment contains four obstacles and the robot is represented by 
a tetrahedron. The roadmap for this environment is simple, because the corridor 
is not very narrow, but we can see in figure 5. that the method can obtain a 
path between init and goal configurations. 

E2: This scene is presented with five obstacles, and the robot is represented 
as a large tetrahedron. In the problem presented here the robot is a little bit 
large respect to the obstacles, nevertheless the algorithm is able to compute a 
path between init and goal configurations, see figure 6. 
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E3: This environment contains four obstacles and the robot is represented by 
a cube. The obstacles are placed in such a way that they form a narrow corridor. 
This roadmap is not easy to calculate, because the size of the robot is big, and 
the configuration become difficult to calculate. We can see in figure 7. that in 
the solution obtained there are many configurations near to the obstacles. 

E4: This scene is presented with two obstacles and we can see that the form 
of the robot is more complex. There is a narrow corridor which becomes difficult 
to solve, nevertheless, the heuristic is able to find a path which goes through the 
corridor, see figure 8. 



6 Conclusions 

We have described a new randomized roadmap method for motion planning for 
collision free path planning. To test the concept, we implemented the method 
for path planning for “free flying object” robots in a three-dimensional space. 
The method was shown to perform well. We think that the heuristic could be 
modified so it can be used to solve motion planning problems for articulated 
robots. Currently, we are working on the free flying objects problems, and we 
are adding the “flatness” features on the heuristic searching to improve the 
connectivity. The main idea is to take advantage of the geometric features on 
the workspace that we can obtain a representative connection of configuration 
space. We would like to prove our technique to solve the so called “alpha puzzle” 
problem. 



Acknowledgments. We would like to thank the Center of Research in In- 
formation and Automation Technologies (CENTIA) at the Universidad de las 
Americas - Puebla. 

This research is supported by the project: Access to High Quality Digital 
Services and Information for Large Communities of Users, CONACYT. 

References 

1. J.M. Ahuactzin and K. Gupta. A motion planning based approach for inverse kine- 
matics of redundant robots: The kinematic roadmap. In Proc. IEEE Internat. 
Conf. Robot. Autom., pages 3609-3614, 1997. 

2. N. Amato, B. Bayazit, L. Dale, C. Jones, and D. Vallejo. Obprm: An obstacle-based 
prm for 3D workspaces. In P.K. Agarwal, L. Kavraki, and M. Mason, editors, 
Robotics: The Algorithm Perspective. AK Peters, 1998. 

3. N. M. Amato and Y. Wu. A randomized roadmap method for path and manip- 
ulation planning. In Proc. IEEE Internat. Conf. Robot. Autoum,. Pages 113-120, 
Mineapolis, MN, April 1996. 

4. J. Barraquand and J.C. Latombe. Robot motion planning: A distributed represen- 
tation approach. Internat. J. Robot. Res., 10(6):628-649,1991. 




Motion Planning Based on Geometric Featnres 497 



5. V. Boor, M. Overmars, and F. van der Stappen. The gaussian sampling strategy 
for probabilistic roadmap planners. In Proc. IEEE Int.Conf. on Rob. and Autom., 
pages 1018-1023,1999. 

6. D. Halperin, L. E. Kavraki, and J.-C. Latombe. Robotics. In J. Goodman and J. 
O’Rourke, editors, Discrete and Computational Geometry., pages 755-778. CRC 
Press, NY, 1997. 

7. D. Hsu, J.C. Latombe, and R. Motwani. Path planning in expansive configuration 
spaces. In Proc. IEEE Internat. Conf. Robot. Autoum,. Pages 2719-2726, 1997. 

8. L. Kavraki and J. C. Latombe. Probabilistic roadmaps for robot path planning. 
In K. G. and A. P. del Pobil, editor, Praetical Motion Planning in Robotics: Cur- 
rent Approaches and Puture Challenges, pages 33-53. John Wiley, West Sussex, 
England, 1998. 

9. L. Kavraki and J. C. Latombe. Randomized preprocessing of conhguration space 
for fast path planning. In Proc. IEEE Internat. Conf. Robot. Autoum,. Pages 2138- 
2145, 1994. 

10. L. Kavraki, P. Svestka, J.C. Latombe, and M. Overmars. Probabilistic roadmaps for 
path planning in high-dimensional configuration spaces. In Proc. IEEE Internat. 
Conf. Robot. Autoum,. 12(4): 566-580, August 1996. 

11. J-C. Latombe. Robot Motion Planning. Kluwer Academic Publishers, Boston, MA, 
1991. 

12. T. Lozano-Perez. Spatial planning: a configuration space approach. IEEE Tr. On 
Computers, 32:108-120, 1983. 

13. M. Overmars. A Random Approach to Motion Planning , Tecnical Report RUU- 
CS-92-32, Computer Science, Utrecht University, the Netherlands, 1992. 

14. M. Overmars and P. Svestka. A probabilistic learning approach to motion planning. 
In Proc. Workshop on Algorithmic Foundations of Robotics., pages 19-37 1994. 

15. D. Vallejo, I. Remmler, N. Amato. An Adaptive Framework for “Single Shot” Mo- 
tion planning: A Selft-Tuning System for Rigid and Articulated Robots. In Proc. 
IEEE Int. Conf. Robot. Autom., (ICRA), 2001. 

16. S. A. Wilmarth, N. M. Amato, and P. F. Stiller. Maprm: A probabilistic roadmap 
planner with sampling on the medial axis of the freespace. In Proc. IEEE Int. 
Conf. Robot, and Autom., Detroit, MI, 1999. 




Comparative Evaluation of Temporal Nodes 
Bayesian Networks and Networks of 
Probabilistic Events in Discrete Time 



S.F. Galan^, G. Arroyo-Figueroa^, F.J. Diez^, and L.E. Sucar^ 

^ Departamento de Inteligencia Artificial, UNED, Madrid, Spain 
{ seve , f j diezjOdia . uned . es 

^ Institute de Investigaciones Electricas, Cuernavaca, Mexico 
garroyoSiie . org . mx 
® ITESM - Campus Cuernavaca, Mexico 
esucar@itesm.mx 



Abstract. Temporal Nodes Bayesian Networks (TNBNs) and Networks 
of Probabilistic Events in Discrete Time (NPEDTs) are two different 
types of Bayesian networks (BNs) for temporal reasoning. Arroyo- Figuer- 
oa and Sucar applied TNBNs to an industrial domain: the diagnosis and 
prediction of the temporal faults that may occur in the steam generator 
of a fossil power plant. We have recently developed an NPEDT for the 
same domain. In this paper, we present a comparative evaluation of these 
two systems. The results show that, in this domain, NPEDTs perform 
better than TNBNs. The ultimate reason for that seems to be the finer 
time granularity used in the NPEDT with respect to that of the TNBN. 
Since families of nodes in a TNBN interact through the general model, 
only a small number of states can be defined for each node; this limitation 
is overcome in an NPEDT through the use of temporal noisy gates. 



1 Introduction 

Bayesian networks (BNs) [7] have been successfully applied to the modeling of 
problems involving uncertain knowledge. A BN is an acyclic directed graph whose 
nodes represent random variables and whose links define probabilistic dependen- 
cies between variables. These relations are quantified by associating a conditional 
probability table (GPT) to each node. A GPT defines the probability of a node 
given each possible configuration of its parents. BNs specify dependence and in- 
dependence relations in a natural way through the network topology. Diagnosis 
or prediction with BNs consists in fixing the values of the observed variables and 
computing the posterior probabilities of some of the unobserved variables. 

Temporal Nodes Bayesian Networks (TNBNs) [1] and Networks of Probabilis- 
tic Events in Discrete Time (NPEDTs) [5] are two different types of BNs for 
temporal reasoning, both of them adequate for the diagnosis and prediction of 
temporal faults occurring in dynamic processes. Nevertheless, the usual method 
of applying BNs to the modeling of temporal processes is based on the use of 
Dynamic Bayesian Networks (DBNs) [3,6]. In a DBN, time is discretized and 
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an instance of each random variable is created for each point in time. While in 
a DBN the value of a variable Vi represents the state of a real-world property 
at time ti, in either a TNBN or an NPEDT each value of a variable represents 
the time at which a certain event may occur. Therefore, TNBNs and NPEDTs 
are more appropriate for temporal fault diagnosis, because only one variable is 
necessary for representing the occurrence of a fault and, consequently, the net- 
works involved are much simpler than those obtained by using DBNs (see [5], 
Section 4). However, DBNs are more appropriate for monitoring tasks, since they 
explicitly represent the state of the system at each moment. 



1.1 The Industrial Domain 

Steam generators of fossil power plants are exposed to disturbances that may 
provoke faults. The propagation of these faults is a non-deterministic dynamic 
process whose modeling requires representing both uncertainty and time. 

We are interested in studying the disturbances produced in the drum level 
control system of a fossil power plant. The drum provides steam to the super- 
heater and water to the water wall of a steam generator. The drum is a tank 
with a steam valve at the top, a feedwater valve at the bottom, and a feedwater 
pump which provides water to the drum. There are four potential disturbances 
that may occur in the drum level control system: a power load increase {LI), 
a feedwater pump failure {FWPF), a feedwater valve failure {FWVF), and a 
spray water valve failure {SWVF). These disturbances may provoke the events 
shown in Figure 1. In this domain, we consider that an “event” occurs when a 
signal exceeds its specified limit of normal functioning. 

Arroyo-Figueroa and Sucar applied TNBNs to the diagnosis and prediction 
of the temporal faults that may occur in the steam generator of a fossil power 
plant [2]. In this work, we describe the development of an NPEDT for the same 
domain. We also present a comparative evaluation of both networks. 

This paper is organized as follows. Sections 2 and 3 give some details re- 
garding the application to our industrial domain of TNBNs and NPEDTs, re- 
spectively. Section 4 explains the process of selection of the evaluation method 
for the domain and presents the results obtained from it for the two systems 
considered. Finally, Section 5 summarizes the main achievements of this work. 



2 TNBN for This Industrial Domain 

Arroyo-Figueroa and Sucar developed a formalism called Temporal Nodes Bayes- 
ian Networks (TNBNs) [1] that combines BNs and time. They applied this for- 
malism to fault diagnosis and prediction for the steam generator of a fossil power 
plant [2]. A TNBN is an extension of a standard BN, in which each node rep- 
resents a temporal event or change of state of a variable. There is at most one 
state change for each variable in the temporal range of interest. The value taken 
on by the variable represents the interval in which the change occurred. Time 
is discretized in a finite number of intervals, allowing a different number and 
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Fig. 1. Possible disturbances in a steam generator 



duration of intervals for each node (multiple granularity) . Each interval defined 
for a child node represents the possible delays between the occurrence of one 
of its parent events and the corresponding change of state of the child node. 
Therefore, this model makes use of relative time in the definition of the values 
associated to each temporal node with parents. 

There is an asymmetry in the way evidence is introduced in the network: The 
occurrence of an event associated to a node without parents constitutes direct 
evidence, while evidence about a node with parents is analyzed by considering 
several scenarios. When an initial event is detected, its time of occurrence fixes 
the network temporally. 

The causal model used by Arroyo-Figueroa and Sucar in the construction of 
their model is shown in Figure 1. The parameters of the TNBN were estimated 
from data generated by a simulator of a 350 MW fossil power plant. A total of 
more than nine hundred cases were simulated. Approximately 85% of the data 
were devoted to estimate the parameters, while the remaining 15% was used in 
the evaluation of the model, which we discuss later in the paper. 

TNBNs use the general model of causal interaction, but lack a formalization 
of canonical models for temporal processes. Furthermore, each value defined for 
an effect node, which is associated to a determined time interval, means that 
the effect has been caused during that interval by only one of its parent events. 
However, this is not the general case in some domains where evidence about the 
occurrence of an event can be explained by several of its causes. 
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3 NPEDT for This Industrial Domain 

In an NPEDT [5], each variable represents an event that can take place at most 
once. Time is discretized by adopting the appropriate temporal unit for each 
case (seconds, minutes, etc.); therefore, the temporal granularity depends on the 
particular problem. The value taken on by the variable indicates the absolute 
time at which the event occurs. 

Formally speaking, a temporal random variable V in the network can take 
on a set of values v[i], i G {a,... ,b, never}, where a and b are instants — or 
intervals — defining the limits of the temporal range of interest for V. The links in 
the network represent temporal causal mechanisms between neighboring nodes. 
Therefore, each CPT represents the most probable delays between the parent 
events and the corresponding child event. For the case of general dynamic inter- 
action in a family of nodes, giving the CPT involves assessing the probability 
of occurrence of the child node over time, given any temporal configuration of 
the parent events. In a family of n parents Xi , . . . , X„ and one child Y, the 
CPT is given by P(y[ty] | Xi[ti], . . . with ty G {0, . . . ,ny, never} and 

C G {0, . . . , Hi, never}. The joint probability is given by the product of all the 
CPTs in the network. Any marginal or conditional probability can be derived 
from the joint probability. 

If we consider a family of nodes with n parents and divide the temporal 
range of interest into i instants, in the general case the CPT associated to the 
child node requires independent conditional probabilities. In real-world 

applications, it is difficult to find a human expert or a database that allows 
us to create such a table, due to the exponential growth of the set of required 
parameters with the number of parents. For this reason, temporal canonical 
models were developed as an extension of traditional canonical models. In this 
fault-diagnosis domain, we only need to consider the temporal noisy OR-gate [5]. 

3.1 Numerical Parameters of the NPEDT 

In our model, we consider a time range of 12 minutes and divide this period into 
20-second intervals. Therefore, there are 36 different intervals in which any event 
in Figure 1 may occur. Given a node E, its associated random variable can take 
on values {e[l], . . . , e[36], e[nefer]}, where e[i] means that event E takes place 
in interval i, and e[never] means that E does not occur during the time range 
selected. For example, SWF = su'/[3] means “spray water flow increase occurred 
between seconds 41 and 60” . As the values of any random variable in the network 
are exclusive, its associated events can only occur once over time. This condition 
is satisfied in the domain, since the processes involved are irreversible. Without 
the intervention of a human operator, any disturbance could provoke a shutdown 
of the fossil power plant. 

We use the temporal noisy OR-gate as the model of causal interaction in the 
network. In this model, each cause acts independently of the rest of the causes to 
produce the effect. Independence of causal interaction is satisfied in the domain, 
according to the experts’ opinion. 




502 



S.F. Galan et al. 



Computing the CPT for a node Y in the network requires specifying 

I x^[jij,xi[never],l ^ i) 

for each possible delay, k — ji, between cause Xi and Y, when the rest of the 
causes are absent. Therefore, given that Xi takes place during a certain 20-second 
interval, it is necessary to specify the probability of its effect Y taking place in 
the same interval — if the rest of the causes are absent — , the probability of Y 
taking place in the next interval, and so on. These parameters were estimated 
from the same dataset used by Arroyo-Figueroa and Sucar in the construction 
of the TNBN. 

In the NPEDT, evidence propagation through exact algorithms takes, in 
general, a few seconds by using Elvira^ and the factorization described in [4]. (If 
that factorization is not used, evidence propagation takes almost one minute.) 
Consequently, this network could be used in a fossil power plant to assist human 
operators in real-time fault diagnosis and prediction. 



4 Evaluation of the TNBN and the NPEDT 

A total of 127 simulation cases were generated for evaluation purposes by means 
of a simulator of a 350 MW fossil power plant [8] . Each case consists of a list 



((euenti, ti), (event 2 ,t 2 ), ■ . ■ , (eventi 4 , tu)) 

where U is the occurrence time for eventi. There are 14 possible events, as Figure 
1 shows. If eventi did not occur then ti = never. In general, among the 14 pairs 
included in each case, some of them correspond to evidence about the state of 
the steam generator. 

4.1 Selection of the Evaluation Method 

Our first attempt to quantify the performance of each model was carried out as 
follows. For each node or event X not included in the evidence: 

1. Calculate P*{X) = P{X \ e), the posterior probability of node X, given 
evidence e. 

2. For each simulated case that includes e, obtain ME{P*{X),tx), a “measure 
of error” between the posterior probability and tx, the real (or simulated) 
occurrence time for X. 

3. Calculate the mean and variance of the measures of error obtained in the 
previous step. 



^ Elvira is a software package for the construction and evaluation of BNs and influence 
diagrams, which is publicly available at http://www.ia.uned.es/~elvira 
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Given a probability density function for a variable V, /y(f), if V took place 
at tv, a possible measure of error is 

r+oo 

ME{fv{t),iv) = / fv{t)-\t-iv\ dt. (1) 

Jo 

This measure represents the average time distance between an event occurring 
at tv and another one that follows distribution fv{t)- For example, if /y(t) is a 
constant distribution between ti and tf (with ti < tf): 



fv{t) 




if t < ti 
ifti <t <tf 
ift > tf 



(2) 



then 



ME{fv{t),iv) 



< 



P • {iv - 



if tv < ti 

if ti < ty 'El tf ■ 
if ty >tf 



Note that, if time is the variable considered, ME is equivalent to the prediction 
error (difference between the observation at time t and the expected forecast 
value for time t). The probability distribution fv{t) can be directly obtained 
from P*{V) in an NPEDT, while in a TNBN it is necessary to know which 
parent node is really causing V, which can be deduced from the information 
contained in the corresponding simulated case. 

Two problems arise when we try to apply Equation 1 to a node of either a 
TNBN or an NPEDT: 



• If, given a simulated case, event V does not occur, we can only assign tv the 
value +oo; as a consequence, the integral in Equation 1 cannot be calculated. 

• If P*{y = v[never]) > 0, the value t in Equation 1 cannot be precisely 
defined for V = v[never]; if we supposed that t = +oo, the integral could 
not be computed, as in the previous problem. 

In order to avoid these two problems, we adopted an alternative point of view: 
Instead of a measure of error, a measure of proximity between P*{V) and tv 
can be used for evaluating the networks. Given a probability density function 
for a variable V, fv{t), if V took place at iv, a possible measure of proximity is 

MP{fv{t),tv) = dt (3) 

Jo 1 T ( I 



where c is an arbitrary constant. We have selected this function because it has 
four desirable properties: 
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1. As Jq°° fv{t) dt = 1, 0 < MP < 1. MP = 1 when fv{t) is a Dirac delta 
function at ty- 

2. Note that when t = ty, the value of the integrand is fy{ty); however, as 
|t — ty I — + 00 , the integrand approaches 0 regardless of the value of fy. 
The two following properties deal with the two problems presented above 
regarding the measure of error. 

3. If, given a simulated case, event V does not occur {ty = +oo), the integrand 
is zero when t ^ never and we consider that MP = P*(V = v[never\). 

4. If P*{V = v[never\) > 0 and iy takes on a finite value, we consider that the 
contribution of = v[never] to MP is 0. 

5. When the density function is constant inside an interval, MP can be easily 
calculated. 

Since TNBNs and NPEDTs are discrete-time models, we calculate MP (given 
by Equation 3) by adding the contributions of each interval associated to the 
values of node V and the contribution of value never. P*{V) defines a con- 
stant probability distribution over each of the intervals defined for V. Given the 
constant distribution defined in Equation 2, 

MP (^fy{t),iy) = ^ ■ ^arctan — arctan ^ ’ 

As expected, the maximum measure of proximity appears when iy = , We 

have used c = 360 in the TNBN and the NPEDT. In the NPEDT tf — U = 
20 seconds, while in the TNBN tf — ti is specific of each interval. 

4.2 Results 

By using the measure of proximity proposed in Section 4.1, we have performed 
tests for prediction and for diagnosis from the 127 simulation cases available. 

Prediction. In order to analyze the predictive capabilities of the networks, we 
have carried out four different types of tests. In each of them there was only 
an initial fault event present: SWVF, FWVF, FWPF and LI, respectively. The 
states of the rest of the nodes in the networks were unknown. The time at which 
the corresponding initial fault event occurred defines the beginning of the global 
time range considered. Among the 127 simulated cases, 64 are associated to the 
presence of LI, and the rest of the initial fault events are simulated by means of 21 
cases each. Tables 1 through 4 contain the means and variances of the measures 
of proximity obtained separately for both the TNBN and the NPEDT in the 
predictive tests. The average of the values shown in the last file of each table 
are: /x(TNBN) = 0.789003, ct 2(TNBN) = 2.603E-4, ^(NPEDT) = 0.945778, and 
ct2(NPEDT) = 1.509E-3. 

The results show that the NPEDT predicts more accurately than the TNBN. 
In general, the difference between the exactitude of predictions from the two 
networks grows as we go down in the graph. Both networks predict correctly 
that some events do not occur. Such events have been omitted in the tables. 
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Table 1. Means and variances of MP when SWVF is present 



Node 


(TNBN) 


(TNBN) 


fi (NPEDT) 


cr-^ (NPEDT) 


SWV 


0.99786 


2.185E-6 


0.996066 


3.456E-6 


SWF 


0.85793 


4.267E-6 


0.987375 


4.138E-5 


STT 


0.55219 


2.515E-3 


0.874228 


0.011258 


Average 0.80266 


8.404E-4 


0.952556 


3.767E-3 



Table 2. Means and variances of MP when FWVF is present 



Node 


fi (TNBN) 


cr^ (TNBN) 


(NPEDT) 


(NPEDT) 


FWV 


0.99844 


2.053E-6 


0.995963 


1.286E-6 


FWF 


0.88154 


8.99E-5 


0.957003 


2.39E-4 


SWF 


0.71559 


1.069E-6 


0.818828 


4.704E-4 


DRL 


0.85165 


1.035E-3 


0.914457 


0.003583 


DRP 


0.93127 


6.317E-6 


0.895225 


0.006293 


STT 


0.14576 


5.724E-8 


0.832621 


0.001401 


jAverage 0.754041 


1.89E-4 


0.902349 


1.998E-3 



Table 3. Means and variances of MP when FWPF is present 



Node 


(TNBN) 


cr^ (TNBN) 


^ (NPEDT) 


cr^ (NPEDT) 


FWP 


0.87001 


3.522E-8 


0.996962 


7.425E-7 


FWF 


0.90496 


1.478E-5 


0.989113 


2.018E-5 


SWF 


0.89533 


1.537E-5 


0.972282 


1.598E-4 


DRL 


0.88665 


5.776E-8 


0.976043 


9.658E-5 


DRP 


0.93262 


9.382E-7 


0.975946 


6.43E-5 


STT 


0.14463 


4.961E-8 


0.954822 


4.348E-4 


lAverage 0.772366 


5.205E-6 


0.977528 


1.294E-4 



Table 4. Means and variances of MP when LI is present 



Node 


H (TNBN) 


cr'' (TNBN) 


(NPEDT) 


(NPEDT) 


FWP 


0.93043 


9.377E-6 


0.988255 


2.594E-5 


STV 


0.99858 


1.455E-6 


0.997886 


8.289E-7 


FWF 


0.83527 


1.682E-5 


0.97729 


2.795E-4 


STF 


0.99694 


8.829E-6 


0.992117 


1.35E-5 


SWF 


0.69901 


2.722E-6 


0.967232 


6.314E-4 


DRL 


0.62306 


5.715E-8 


0.71745 


2.147E-5 


DRP 


0.99204 


1.433E-5 


0.978515 


1.271E-4 


STT 


0.540239 


1.094E-7 


0.986699 


5.011E-5 


Average 0.826946 


6.712E-6 


0.95068 


1.437E-4 
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Table 5. Means and variances of MP when STT and one of its parents are present 



Node 


H (TNBN) 


a-' (TNBN) 


/i (NPEDT) 


cr-^ (NPEDT) 


SWVF 


0.701059 


0.041 


0.802444 


0.118323 


FWVF 


0.449978 


3.147E-3 


0.742676 


0.071108 


FWPF 


0.450636 


3.064E-3 


0.754429 


0.062699 


LI 


0.655584 


0.114252 


0.995538 


3.649E-5 


swv 


0.762227 


0.01292 


0.801968 


0.118188 


FWV 


0.446817 


3.518E-3 


0.742875 


0.070954 


FWP 


0.428903 


0.044449 


0.688344 


0.041785 


STV 


0.6529 


0.115369 


0.995781 


1.091E-5 


FWF 


0.488658 


0.04751 


0.5607 


0.075691 


STF 


0.630933 


0.100964 


0.999926 


2.671E-9 


SWF 


0.609587 


0.025558 


0.999525 


2.228E-7 


DRL 


0.562921 


0.049913 


0.510202 


0.109071 


DRP 


0.809094 


0.117564 


0.662385 


0.164828 


Average 0.588448 


0.052248 


0.788984 


0.064053 



Diagnosis. The diagnostic capabilities of the TNBN and the NPEDT were 
studied on one type of test: The final fault event, STT (see Figure 1), was 
considered to be present. The occurrence time for STT established the end of 
the global time range analyzed. In this type of test, all the 127 simulated cases 
were used. Since in a TNBN the introduction of evidence for a node with parents 
requires knowing which of them is causing the appearance of the child event, in 
this type of test it was necessary to consider information from two nodes: STT 
and its causing parent. Table 5 includes the means and variances of the measures 
of proximity obtained in this test. Again, the NPEDT performs better than the 
TNBN. 

Although in general the measures of proximity for diagnosis are lower than 
those for prediction, that does not mean that our BNs perform in diagnosis worse 
than in prediction. There is another reason that explains this result: If we had 

- two different probability density functions, /y(t) and fw{t), the former 
sparser (more spread out) than the latter, and 

- two infinite sets of cases, Cy and Cw, following distributions /y(t) and 
fw{t)i respectively, 

then, from Equation 3, MP would be lower on average for V than for W, since 
t — iy is on average greater than t — iw Therefore, although a BN yielded 
satisfactory inference results both for variable V and variable W, Equation 3 
could in general produce different average MPs for V with respect to W. This 
is taking place in our tests. For example, in the NPEDT we calculated the mean 
number of states per node whose posterior probability was greater than 0.001. 
While in the prediction tests this number was approximately 5, in diagnosis it 
rose to nearly 9. Anyhow, the measure of proximity defined in Equation 3 allows 
us to carry out a comparative evaluation of the TNBN and the NPEDT. 





Comparative Evaluation of Temporal Nodes Bayesian Networks 507 



5 Conclusions 

Arroyo-Figueroa and Sucar applied TNBNs to fault diagnosis and prediction 
for the steam generator of a fossil power plant. We have recently developed an 
NPEDT for the same domain and have carried out a comparative evaluation 
of the two networks. Our evaluation method is based on a proximity measure 
between the posterior probabilities obtained from the networks and each of the 
simulated cases available for evaluation. Since the faults that may occur in the 
steam generator of a fossil power plant constitute dynamic processes, the proxim- 
ity measure takes into account time as an important variable. We have performed 
different tests in order to compare the predictive as well as the diagnostic ca- 
pabilities of the TNBN and the NPEDT. The results show that in general the 
NPEDT yields better predictions and diagnoses than the TNBN. There are two 
main reasons for that: Firstly, the use of temporal noisy gates in the NPEDT 
allows for a finer granularity than in the case of the TNBN and, secondly, the 
definition of the intervals in a TNBN is not so systematic as in an NPEDT and 
depends strongly on the domain. 
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Abstract. In this paper we present a new approach for the problem of 
approximating a function from a training set of I/O points using fuzzy 
logic and fuzzy systems. Such approach, as we will see, will provide us a 
number of advantages comparing to other more- limited systems. Among 
these advantages, we may highlight the considerable reduction in the 
number of rules needed to model the underlined function of this set of 
data and, from other point of view, the possibility of bringing interpre- 
tation to the rules of the system obtained, using the Taylor Series con- 
cept. This work is reinforced by an algorithm able to obtain the pseudo- 
optimal polynomial consequents of the rules. Finally the performance of 
our approach and that of the associated algorithm are shown through a 
significant example. 

1 Introduction 

The Function Approximation problem deals with the estimation of an unknown 
model from a data set of continuous input /output points; the objective is to 
obtain a model from which to get the expected output given any new input data. 

Fuzzy Logic on the other hand is one of the three roots of soft-computing; 
it has been successfully applied to several areas in scientific and engineering 
sectors, due to its broad number of benefits. The simplicity of the model and 
its understandability, while encapsulating complex relations among variables is 
one of the keys of the paradigm. The other main characteristic is its capability 
to interpret the model, for example through the use of linguistic values to bring 
meaning to the variables involved in the problem. 

Many authors have dealt with Fuzzy logic and Fuzzy Systems for function 
approximation from an input/output data set, using clustering techniques as 
well as grid techniques, obtaining in general good enough results. Specifically, 
the TSK model [7] fits better to these kind of problems due to it’s computational 
capability. 
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(a) (b) 

Fig. 1. a) MF distribution used for this example. Target function: y = (x - 10)2 b) 
Original function + model output + linear submodels for each of the three rules using 
a TSK model of order 1. We see how the global output of the TSK fuzzy system is 
eye-indistinguishable from the actual output, but no interpretation can be given to the 
three linear sub-models. 

The fuzzy inference method proposed by Takagi, Sugeno and Kang, which 
is known as the TSK model in fuzzy systems field, has been one of the major 
issues in both theoretical and practical research for fuzzy modelling and control. 
The basic idea is the subdivision of the input space into fuzzy regions and to 
approximate the system in each subdivision by a simple model. 

The main advantage of the TSK model is its representative power, capable 
of describing a highly complex nonlinear system using a small number of simple 
rules. In spite of this, the TSK systems suffer from the lack of interpretability, 
which should be one of the main advantages of fuzzy systems in general. While 
the general performance of the whole TSK fuzzy system give the idea of what 
it does, the sub-models given by each rule in the TSK fuzzy system can give no 
interpretable information by themselves [1]. See Fig 1. 

Therefore, this lack of interpretability might force any researcher not to use 
the TSK models in problems where the interpretability of the obtained model 
and corresponding sub- models is a key concept. 

Apart from the interpretability issue the number of rules for a working model 
is also a key concept. For control problems, grid-based fuzzy systems are prefer- 
able since they cover the whole input space (all the possible operation regions in 
which the plant to control can be stated during its operation) . Nevertheless, for 
Mamdani fuzzy systems or even for TSK fuzzy systems of order 0 or 1, although 
getting pseudo-optimal solutions, they usually need an excessive number of rules 
for a moderated number of input variables. 

In this paper we propose the use of high order TSK rules in a grid based 
fuzzy system, reducing the number of rules, while keeping the advantages of the 
grid-based approach for control and function approximation. Also to keep the 
interpretability of the model obtained, we present a small modification for the 
consequents of the high order TSK rules in order to provide the interpretability 
for each of the sub-models (rules) that compose the global system. 
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The rest of the paper is organized as follows: Section 2 presents high or- 
der TSK rules with an algorithm to obtain the optimal coefficients of the rule 
consequents. Section 3 provides an introduction to the Taylor Series Expansion, 
concept that will provide the key for the interpretability issue, commented in 
Section 4. Finally, in section 5 it is provided a whole example that demonstrates 
the suitability and goodness of our approach. 

2 High-Order TSK Fuzzy Rules 

The fuzzy inference system proposed by Takagi, Sugeno and Kang, known as the 
TSK model in the fuzzy system literature, provides a powerful tool for modelling 
complex nonlinear systems. Typically, a TSK model consists of IF-THEN rules 
that have the form: 

: IF xi is AND . . . AND x„ is THEN 

y = ag+a^Xi + ... + a^x„ (I) 

where the are fuzzy sets characterized by membership functions A^{x), 
are real-valued parameters and Xi are the input variables. 

A Sugeno approximator comprises a set of TSK fuzzy rules that maps any 
input data x = [xi, X 2 , . . . , x„] into its desired output y G IR. The output of the 
Sugeno approximator for any input vector x, is calculated as follows: 

K 

E tJ'k(x)yk 

F{x) = ^ (2) 

E 

k=l 

Provided that /ifc(x)is the activation value for the antecedent of the rule k, 
and can be expressed as: 

yk{x) = A'^{xi)A2{x 2) ■ ■ .A^{xn) (3) 

The main advantage of the TSK model is its representative power; it is capa- 
ble of describing a highly nonlinear system using a small number of rules. More- 
over, since the output of the model has an explicit functional expression form, it 
is conventional to identify its parameters using some learning algorithms. These 
characteristics make the TSK model very suitable for the problem of function 
approximation; a high number of authors have successfully applied TSK systems 
for function approximation. For example, many well-known neuro-fuzzy systems 
such as ANFIS [4] have been constructed on the basis of the TSK model. 

Nevertheless very few authors have dealt with high-order TSK fuzzy systems. 
Buckley [5] generalized the original Sugeno inference engine by changing the form 
of the consequent to a general polynomial, that is: 

R’^ : IF xi is Aj AND . . . AND x„ is THEN y = Yfc(x) 



(4) 
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where Yk{x) is a polynomial of any order. Taking order 2, it can be expressed as: 

Yk{x) = Wq ■ X + ^x^W’^x (5) 

Where wq is a scalar, w is a column vector of coefficients with dimension n 
(one per each input variable) and IT is a triangular matrix of dimensions nxn, 
{Wij = coefficient for quadratic factor Xi * Xj, i = 1 ... n, j = i ... n) . 

Now that we have defined how a TSK fuzzy system can be adapted to work 
with high-order rules, let’s see, given a set of input/output data, and a con- 
figuration of membership functions for the input variables, how to adapt the 
consequents of the rules so that the TSK model output optimally fits the data 
set D. The Least Square Error (LSE) algorithm will be used for that purpose. 
LSE tries to minimize the error function: 

J =^{Vm- F{x)f (6) 

m£D 



where F is the output of the TSK fuzzy system as in (2). Setting to 0 the first 
derivative (7, 8) of each single parameter (wq and each component of w and W) 
will give us a system of linear equations from which to obtain the optimal values 
of the parameters. 



dJ 

dwsi 



( 



2 E 

mGD 










E 



K 

k = l 



(7) 



2^ K 



K 



= E 

k—1 i m^D 



K 

i=l 



(8) 



Where Wki -rule k, coefficient i- is the coefficient we are differentiating in 
each case {wq or any component of W or w), and fwi is the partial derivative of 
the consequent of rule k with respect to Wi, i.e., 1 for the 0-order coefficient wq, 
Xi for every first-order coefficient Wi , or Xp ■ Xj for every second-order coefficient 
Wpj of W. 

Once we have the system of linear equations, it only remains to obtain the 
optimal solution for all the coefficients of every rule. The Orthogonal Least- 
Square (OLS) method [6] will guarantee a single optimal solution obtaining the 
values for the significant coefficients while discarding the rest. We reject therefore 
the problems due to the presence of redundancy in the activation matrix. 

Once that we have already reviewed the type of rules that we are going to 
operate with, now let’s review the “lack of interpretability curse” that suffer TSK 
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Fuzzy Systems as we saw in Section 1 . As polynomials are not easy interpretable 
as consequents of the rules, we will give now the key for the interpretability for 
our Taylor-Series based rules. 

3 Taylor Series-Based Fuzzy Rules (TSFR) 

Let f{x) be a function defined in an interval with an intermediate point a, for 
which we know the derivatives of all orders. The first order polynomial: 

Pi{x) = f{a) + f'{a){x- a) (9) 

has the same value as f{x) in the point x = a and also the same first order 
derivative at this point. Its graphic representation is a tangent line to the graph 
of f{x) at the point x = a. 

Taking also the second derivative for f{x) in x = a, we can build the second 
order polynomial 

P 2 {x) = f{a) + f{a){x -a) + ^f"{a){x - a)'^ (10) 

which has the same value as f{x) at the point x = a, and also has the same 
values for the first and second derivative. The graph for this polynomial in x = a, 
will be more similar to that of /(x) in the points in the vicinity of x = a. We 
can expect therefore that if we build a polynomial of nth order with the n first 
derivatives of /(x) in x = a, that polynomial will get very close to /(x) in the 
neighbourhood of x = a. 

Taylor theorem states that if a function /(x) defined in an interval has deriva- 
tives of all orders, it can be approximated near a point x = a, as its Taylor Series 
Expansion around that point: 

/(x) = /(a) -k /'(a)(x -a) + ^/"(a)(x - a)'^ + . . . 

+ i/(")(a)(x - a)" + ^^/("+D(c)(x - a)"+i (11) 

n! (n + a)\ 

where in each case, c is a point between x and a. 

For n-dimensional purposes, the formula is adapted in the following form: 

f{x) = f{a) + {x-a)'^ w^(a) -k i(x - a)^VF(x - a) -k 

OXi ^ 

j^lF^(x — a, X — a, X — a) -k . . . (12) 

where kF is a triangular matrix of dimensions nxn, and is a triangular 
multi-linear form in s vector arguments v^, . . . ,v^. 

Taylor series open a door for the approximation of any function through 
polynomials, that is, through the addition of a number of simple functions. It 
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is therefore a fundamental key in the field of Function Approximation Theory 
and Mathematical Analysis. Taylor Series Expansion will also provide us a way 
to bring interpretation to TSK fuzzy systems by taking a certain type of rules 
consequents and antecedents, as we will now see. 

As noted in [3] we will use input variables in the antecedents with membership 
functions that form an Orderly Local Membership Function Basis (OLMF). The 
requirements that a set of membership functions for a variable must fulfil to be 
an OLMF basically are: 

“ Every membership function extreme point must coincide with the centre of 
the adjacent membership function. 

— The n-th derivative of the membership function is continuous in its whole 
interval of definition. 

— The n-th derivative of the membership function vanishes at the centre and 
at the boundaries. 

The main advantage of using this kind of membership functions is the dif- 
ferentiability of the output of the TSK fuzzy system. This is not possible when 
we have triangular or trapezoidal membership functions, since the derivative at 
the centres of the membership functions does not exist, therefore not having a 
differentiable fuzzy system output. 

These OLMF bases also have the addition to unity property: the addition of 
the activations of all the rules is always equal to unity for any point inside the 
input domain in a TSK fuzzy system that keeps the OLMF basis restrictions. 
Therefore the output of the TSK fuzzy system can be expressed as: 

K 

F{x) = fJ^k{x)yk (13) 

k^l 

Then the OLS method cited in Section 2 will work well for the given system, 
and can identify the optimal coefficients without needing another execution of 
the algorithm as noticed in [6]. 

Finally, given that the input variables have a distribution of membership 
functions that form a OLMF basis, we will use high-order TSK rules in the form 
(4), but where the polynomial consequents are in the form: 

Yfc(f) = Wq ■ w'^(x-dk) -h ^(x-dk)'^ ■ ■ (x-dk) (14) 

being dk the centre of rule k, therefore forming a Taylor Series Expansions around 
the centres of the rules. 

4 Interpretability Issues 

It can be demonstrated [3] that given a Sugeno approximator F{x) such that: 

~ 1) the input variables membership functions form a set of OLMF basis of 
order m (being the m-th derivative continuous everywhere); 
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— 2) the consequent-side is written in the rule-centred form shown in (4) and 

(14) and the polynomials Yk{x) are of degree n. 

Then for n < m, every Yk(x) can be interpreted as a truncated Taylor series 
expansion of order n of F{x) about the point x = , the centre of the fcth rule. 

Supposing therefore that we have a method to obtain the optimal Taylor- 
Series Based TSK rules consequents coefficients for function approximation, 
given a data set and a membership function distribution that form a set of 
OLMF basis, we can interpret then the consequents of the rules Yk(x) as the 
truncated Taylor series expansion around the centres of the rules of the out- 
put of the system. This system also provides a pseudo-optimal approximation 
to the objective function. In the limit case where the function is perfectly ap- 
proximated by our system, the rule consequents will coincide with the Taylor 
Series expansions of that function about centre of each rule, having reached total 
interpretability and total approximation. 

In [3] , Marwan Bikdash used directly the (available) Taylor Series Expansion 
of the function around the rule centres, for each rule, to approximate the function 
with the TSK fuzzy system. Notice that these rule consequents, though having 
strong interpretability, are not the optimal consequents in the least squares sense. 
Please note that the Taylor Series Expansion is an approximation for a function 
in the vicinity of the reference point. Therefore even using a high number of 
MFs, the error obtained by the method in [3] is seldom small enough (compared 
to a system with similar complexity with consequents optimized using LSE) and 
therefore the system output barely represent a good approximation of the data 
we are modelling. 

In this paper we also suppose that the only information we have from the 
function to approximate are the input/output points in the initial dataset. No 
information is given of the derivatives of the function w.r.t. any point. Also, 
there is no accurate way to obtain the derivatives from the training points to 
perform the approximation as the method in [3] required. 

5 Simulations 

Consider a set of 100 randomly chosen I/O data from the 1-D function [2]: 

F{x) = sin{27:x) G [0, 1] (15) 

Let’s try now to model those data using a fuzzy system with 5 membership func- 
tions for the single input variable x forming a OLMF basis and rule consequents 
of the form given by (14), being Yk a order-2 polynomial. 

The five rules obtained after the execution of the LSE algorithm using OLS 
are the following: 

IF x is AiTHEN y = -26.0860x2 -h 5.3247x -k 0.0116 

IF X is AaTHEN y = -1.6235(x - 0.25) -k 0.2882 

IF X is A 3 THEN y = 4.3006(x - 0.5)^ - 0.5193(x - 0.5) (16) 

IF X is A 4 THEN y = -1.1066(x - 0.75)2 q.1780(x - 0.75) - 0.0238 

IF X is A 5 THEN y = 0 
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c) d) 

Fig. 2. a) Original function (solid line) and Taylor Series Expansion based Fuzzy Sys- 
tem (dotted line). We see that for only 5 membership functions, the output of the 
system is very similar to the original function. NRMSE = 0.0154. b) Original function 
-I- model output -|- second membership function consequent (centered at x=0.25). c) 
Original function -|- model output -|- third membership function consequent (centered 
at x=0.5). d) Original function + model output -|- fourth membership function conse- 
quent (centered at x=0.75). We see clearly how these polynomials come closer to the 
Taylor Series Expansion around the centre of the rules of the fuzzy system output. 



The interpretability comes from the fact that the function in the points near 
to each center of the five rules is extremely similar to the polynomial output of 
the rules as shown in Figure 2. These polynomials are kept expressed as Taylor 
Series Expansions of the function in the points in the vicinity of the centres of 
the rules. The system is therefore fully interpretable and also brings some more 
advantages as noticed below. 

Figure 2 also shows clearly that the LSE finds the optimal consequents coef- 
ficients for the given input/output data set. Also it must be noted that for only 
five rules (one per each membership function), the error obtained is sensibly low. 
If we compare the system obtained for the same number of rules with a TSK 
fuzzy system with constant consequents, we see that the error obtained (NRMSE 
= 0.3493) is very high comparing to our Taylor Series Expansion based fuzzy 
system (NRMSE = 0.0154). 
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It should be remembered that the Normalized Root-Mean Square Error 
(NRMSE) is defined as: 



NRMSE = 




(17) 



where Oy is the variance of the output data, and is the mean-square error 
between the system and the dataset D output. 

Also comparing using the same number of parameters, that is, 5 rules for our 
system, 15 rules for constant consequents TSK rules, we observe that the error 
obtained by our Taylor-Based rules system is much lower (NRMSE = 0.0154) 
than for constant consequent rule system (NRMSE = 0.0635). 



6 Conclusions 

In this paper we have presented a very interesting approach to the problem 
of function approximation from a set of I/O points utilizing a special type of 
fuzzy systems. Using an Orderly Local Membership Function Basis (OLMF) and 
Taylor Series-Based Fuzzy Rules, the proposed fuzzy system has the property 
that the Taylor Series Expansion of the defuzzified function around each rule 
centre coincides with that rule’s consequent. This endows the proposed system 
with both the approximating capabilities of TSK fuzzy rules through the use of 
the OLS algorithm, and the interpret ability advantages of pure fuzzy systems. 
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Abstract. When planning and designing a policy intervention and 
evaluation, the policy maker will have to define a strategy which will 
define the (conditional independence) structure of the available data. 
Here, Dawid’s extended influence diagrams are augmented by including 
‘experimental design’ decisions nodes within the set of intervention 
strategies to provide semantics to discuss how a ‘design’ decision strat- 
egy (such as randomisation) might assist the systematic identification 
of intervention causal effects. By introducing design decision nodes into 
the framework, the experimental design underlying the data available 
is made explicit. We show how influence diagrams might be used to 
discuss the efficacy of different designs and conditions under which 
one can identify ‘causal’ effects of a future policy intervention. The 
approach of this paper lies primarily within probabilistic decision theory. 

Keywords: Causal inference; Influence diagrams; Design interventions 
and strategies; Identification of Policy Effects; Directed Acyclic Graphs 
(DAGs); Confounders; Bayesian decision theory. 



1 Introduction 

Intervention has to do with ‘perturbing’ the dynamics of a system. If we say a 
system consists of components which influence each other and we say that its 
dynamics describe the way these components interrelate with each other in an 
equilibrium state, an example could be the road traffic in a town. The system at 
the present has some pre-intervention dynamics or interactions attached to it. 
When we intervene a system, by adding a red light in a corner, we are introducing 
a new component into a system that will imply new post-intervention dynamics. 
The intervention might have both qualitative effects, modifying the structure of 
the system (maybe by ‘blocking’ the interaction between two of its components) 
and quantitative effects, modifying the value of the components. One of the 
main interests consists in describing if and how the intervention is affecting the 
system. So, an evaluation of the intervention effects is required and it is usually 
measured in terms of a response variable, such as the number of accidents. 

Experimental design decisions are usually made in order to assist the isolation 
of the intervention (causal) effects. Randomised allocation of treatments to units 
is a well known practice within medical clinical trials but, because of ethical, so- 
cial and financial considerations, complete randomisation within an experiment 

* This work was partially funded by CONACYT. 
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designed to evaluate a social policy will usually be infeasibly costly. Therefore, 
knowing the details of the policy assignment mechanism and a well planned 
recording of the data (choosing variables to be observed, perhaps through a sur- 
vey) become very relevant issues in order to have all the information needed to 
measure the right ‘causal’ effects (see Rubin(1978)). Implementation of exper- 
imental designs and the recording mechanism will have a cost associated with 
them and policy makers-evaluators will question if it is worth spending certain 
amount of money to implement a ‘proper’ design of experiments. 

Influence diagrams (IDs) will be used to represent the system dynamics and 
interventions graphically. Our interpretation of causal effects for interventions 
in policy programmes will follow Dawid’s approach and be Bayesian decision- 
theoretic [3]. By including ‘experimental design’ decisions in what we call a 
Design Network (DN), in this paper, we maintain that experimental design de- 
cisions are intrinsic to any analysis of policy intervention strategies. We discuss 
when we can evaluate the causal effect of a class of policy intervention strategies 
under a design decision strategy. 

2 Intervention Graphical Framework 

Influence diagrams have been used for over 20 years to form the framework for 
both describing [see Howard and Matheson (1984), Shachter(1986), Smith(1989), 
Oliver and Smith(1990)] and also devising efficient algorithms to calculate effects 
of decisions (See Jensen(2001)) in complex systems which implicitly embody 
strong conditional independence assertions. However, it is only recently that they 
have been used to explain causal relationships (Dawid(2000), Dawid(2002)) in 
the sense described above, and shown to be much more versatile than Causal 
Bayesian Networks (Pearl (2000)). 

In our context, every decision that we make when we are planning and de- 
signing an (experimental) study has an effect on the structure of the data that 
we are to collect. Such decisions can be included in the graphical representation 
of the dynamics of the system using IDs, and the structure of the data avail- 
able to do the evaluation of the policy will be defined by the set of conditional 
independencies that are derived from the graph. 

Definition 1. If X,Y ,Z are random variables with a joint distribution P(-) , we 
say that X is conditionally independent ofY given Z under P, if for any possible 
pair of values (y, z) for (Y, Z) such that p{y, z) > 0, P{x \ y,z) = P{x \ z). This 
can be written following Dawid (1979)[1] notation as {XALY\Z)p. 

Dawid (2002) points out that, traditionally, in IDs conditional distributions 
are given for random nodes, but no description is supplied of the functions or 
distributions involved at the decision nodes, which are left arbitrarily and at the 
choice of the decision maker. If we choose to provide some descriptions about 
the decision rules, then any given specification of the functions or distributions 
at decision nodes constitutes a decision strategy, tt. Decisions taken determine 
what we may term the partial distribution, p, of random nodes given decision 




Causal Identification in Design Networks 



519 



nodes. If E and D denote the set of random events and decisions involved in 
decision strategy tt, respectively, then the full joint specification conformed 
by decision strategy tt and partial distribution of random nodes p for all e G if 
and d € D is given by: (random, decision) = p^^ (e, d) = p{e : d)7T{d : e) 

The graphical representation of the full joint specification p^^ can be done 
by using extended IDs (see Dawid (2002)) that incorporate non-random param- 
eter nodes (9e) and strategy nodes {ttd) representing the ‘mechanisms’ that 
generate random and decision nodes, respectively (i.e. 9e = p{e |pa° (e)) and 
TTd = n{d |pa°(d))). In what he calls augmented DAGs, Dawid(2002) incorpo- 
rates ‘intervention nodes’ F for each of the variables in the influence diagram 
where Fx = x corresponds to ‘setting’ the value of node X to x. The condi- 
tional distribution of X, given Fx = x will be degenerate at the value of a; ( 
i.e. P {X = x\Fx = x) = 1) and he introduces a new value 0 such that when 
Fx = 0 , A is left to have its ‘natural’ distribution, named by Pearl as the ‘idle’ 
system. Action F, in Pearl’s language [10] will correspond to Fx = do{X = x). 
Augmented DAGs, provide a very useful framework to show the differences be- 
tween observation and intervention because they make explicit the structure 
where intervention consists in setting or ‘doing’ values of variables. Also, as the 
specification of F has to be done externally to the graph, the structure of the 
augmented DAG can be kept untouched for other type of specifications of F. 
Figure 1 shows, for a simple case, the usual representation of IDs as well as its 
Extended and Augmented versions, for the set (T, B, Y) where T = (Ti, T 2 , .., Tg) 
represents a set of policy variables (treatment), B = {Bi, B 2 , Br) is a set of 
Background variables (potential confounders depending on its observability and 
the recording mechanism followed) and P is a response variable. 




(a) Influence Diagram 



(b) Extended ID 



(c) Augmented DAG 



Fig. 1. Extended Influence diagrams and Augmented DAGs 



For causal reasoning, it is said that two augmented DAG’s will represent 
equivalent causal assumptions if they have the same ‘skeleton’ and the same 
‘inmoralities’ [3]. Gausal enquiries about the ‘effect of T on P’ are regarded as 
relating to (comparisons between) the distributions of P given Ft = do{T = t') 
for various settings of t' . Intervention node F of the augmented DAG is used 
to discuss the identifiability of atomic intervention (policy) effects under cer- 
tain DAG structures and recording mechanisms of B, R{B). Different structures 
and examples of identifiable and unidentifiable situations have been discussed 
by Pearl (2000), Lauritzen (2001) and Dawid (2002), each of them with their 
particular framework and notation. 
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Definition 2. A DAG on variables {Xi, ...,Xk} of a probability function p(x) 
which factorises p{x\, ..,Xk) = I P^xi) where C {x\, i = 

2, k is a directed acyclic graph with nodes {x\, Xk\ and a directed edge from 
Xj to Xi iff j G poxi- 

It cannot be asserted in general that the effect of setting the value of X to 
x' is the same as the effect of observing X = x' . Only in limited circumstances 
(as when the node for X has no parents in the graph), they will coincide. Pearl 
asserts that, if the ID corresponds to a causal bayes net, then the intervention 
Fx = do{X = x'), transforms the original joint probability function p(xi, ..,Xk) 
into a new probability given by 

r pjxi,..,xk) _ 

p{xi, ..,Xk\ Fxi= do{X, = x'i))= p{x'^ I pa^) 

[ 0 if Xi ^ x'i 

On the other hand, if we were observing naturally Xi = Xi, the probabil- 
ity distribution conditional ‘by observation’, can be obtained by usual rules of 
probability following Bayes theorem, such that 

/ I w ^ P{xi,--,xk) 

p(Xl, Xk \ Xi = Xi) = — r 

P\^i) 

3 Policy versus Experimental Decisions 

When planning and designing a policy intervention and evaluation, the policy 
maker will have to define a strategy that involves ‘policy intervention’ actions 
and ‘(pilot study) experimental design’ actions. The former, will include deci- 
sions relating to how the policy will be implemented (which doses and to whom 
they will be provided). The latter is related to the evaluation of the policy and 
will include some experimental design decisions that will define the (chosen or 
controlled) conditions under which the study will be carried out and the data 
recorded. Both strategies define the (conditional independence) structure of the 
available data through the decision strategy, tt, adopted. A decision strategy is 
conformed by a set of decisions or components. So, if di, ^ 2 , du are the com- 
ponents of a particular decision strategy ttd, then we are interested in describing 
TT]j (di, d 2 , .., di)\E). The set D = {di, d 2 , .., du} will contain ‘policy intervention’ 
decisions and ‘experimental’ decisions and one could define two subsets of D such 
that Dt = {d'; d' is a policy intervention} and De = {d*; d* is experimental}. 

Experimental design decision interventions, De, set the conditions under 
which (future) policy interventions will be evaluated. Design decisions involve 
treatment assignment mechanisms A, and recording mechanism R{B), then 
De = {A,R{B)}. As Rubin (1978) pointed out, causal inference might be sen- 
sitive to the specification of these mechanisms or strategies. Policy assignment 
mechanisms define conditional independence of the data and structures in the 
‘experimental’ graph including all observed, and unobserved nodes. Recording 
mechanisms determine which of the nodes are observed and available to us in 
the data. Both, the treatment assignments used in the experiment and the vari- 
ables available to us, influence our evaluation of causal effects when we want to 
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evaluate Ft- A broader definition of experimental design decisions could include 
in De sampling mechanisms and eligibility criteria (see Madrigal(2004)). 

The way sets of factors are controlled (like randomisation and stratification) 
will have a qualitative effect on the conditional independence structure that the 
data, once recorded, will have under a given policy design strategy. Within this, 
each strategy will need also to quantify the levels or values at which these factors 
are set - e.g. doses of treatments defined. Thus, setting the value of the doses to 1 
or 2, may well have a different quantitative effect on the response, but the effect 
of this value on the basic structure may be not altered. Most of the emphasis 
in practical statistical discussions is usually around the effects of these ‘values’, 
rather than about the structural consequences of the former: the focus of this 
paper. 

When we, as ‘data-collectors’, approach the world, the data we observe will 
depend on our way of approaching it. Any dataset obtained will have been in- 
fluenced by the way it was collected, so it should always be conditioned to the 
strategy followed in its collection. The effect each decision strategy might have 
on the ‘available-data’ structure can be qualitatively different and affect the 
partial distribution p{E : D) in different ways. Whatever the decision strategy 
followed, the data we ‘observe’ in the database (available data) is a representa- 
tion of the partial distribution of the random nodes given the set of decisions 
(made or deliberately ‘not made’). The representation of joint partial distribu- 
tions for observational and experimental data are given by p{E : De = 9) and 
p{E : De = cIe), respectively. In this sense, observational data is a special type 
of experiment. 

So, it is important that design actions and analysis assumptions are com- 
patible. Setting decision strategies has as an output: ‘a structure on the data 
available’, and the ideal will be to analyse it by using statistical models that 
are appropriate to deal with such a structure. Within our graphical framework 
this requires there to be an identifiable map from the estimated experimental 
relationships (indexed by edges in the experimental graph) to the relationship 
of interest (indexed by edges in the target policy graph). 



4 Introducing Experimental Nodes 

Experimental design interventions, De = {A, i?(S)}, will set the conditions 
under which (future) policy interventions Ft will be evaluated. These exper- 
imental actions define the recording and policy assignment mechanisms that 
could involve complex strategies like stratified-cluster-randomisation or some 
contingent-allocation rules. Similarly, policy interventions, Dt, might imply (a 
collection of) atomic, contingent and randomised actions. 

Consider as in Section 2 the set (T, B, Y) where T = (Ti, T 2 , .., T^) represents 
a set of policy variables (treatment), B = (i?i, i? 2 , - ., Br) is a set of background 
variables and E is a response variable. As before, let Ft be future policy inter- 
vention to be evaluated. For simplicity, suppose that T and Y are univariate; 
B does not contain intermediate variables between T and Y (i.e. B consists of 
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pre-intervention variables that will not be affected by T); and the univariate 
future policy consists of atomic interventions Ft = do{T = t'). 

As discussed in section 3, let De = {A, R{B)} where A contains all the policy 
assignment mechanisms and R{B) contains the recording mechanism, such that 
R{Bi) = 1, for i = l,2...r, if Bi is recorded and R{Bi) = 0 if is either 
unobservable or not recorded. Assignment nodes A and recording nodes R can 
be included in the DAG as decision nodes to create a design network (DN). 
Figure 2 shows a design network used for the evaluation of the future policy 
Ft = do{T = t') under 3 possible scenarios. Note that A blocks all the paths 
going from B to the policy node T. This follows from the assumption that A 
captures all the allocation mechanisms for T that might be influenced by the 
background variables B, so that A is the only parent of the policy node T (besides 
the future intervention node Ft) in the design network. Recording nodes, R{B), 
are added for each background variable Bi, introducing the decision of recording 
Bi versus not recording it. A double circle containing a dashed and solid line 
is given to each background node Bi to show its potential observability. It is 
assumed that policy T and response variable Y will be recorded. 




(a) B irrelevant 



(b) B white noise 



(c) B potential confounder 



Fig. 2. Design Network for the evaluation of future policy intervention Ft 

Figures 2(a) and 2(b) show cases where the background variables are ir- 
relevant or represent white noise (respectively) for Y. In both cases, the non- 
confounding condition, YILFt \ T, (see Dawid 2002, section 7) holds and the 
future (atomic) policy intervention can be identified directly from the data avail- 
able using P{y \ t' , Ft = do{T = t')) = P{y \ t' , Ft = 0 ) regardless of the knowl- 
edge we have about the policy assignment and recording mechanisms. A more 
interesting case to discuss experimental design effects is given by 2(c) when a 
potential confounder is present and YILFt \ T does not hold. By introducing as- 
signment nodes A, we can deduce from the design network that Y ALFt \ {T, A) 
and A ALFt hold. The latter condition implies that the future policy interven- 
tion Ft = do{T = t') will be done independently of the experimental design 
conditions chosen to allocate treatments, A. The former implies that given a 
known policy assignment mechanism A = a* and the value of the future policy 
intervention T = t' , learning if t' arose from that assignment mechanism Ft = 0 
or was set externally Ft = do{T = t') is irrelevant for the distribution of y, thus 
implying P{y \ t' , Ft = do{T = t'), A = a*) = P{y \ t' , Ft = %,A = a*). 
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As the policy assignment mechanisms may involve a collection of actions, the 
node A might be expanded to show explicitly the mechanism underlying the as- 
signment. This expansion will typically involve parameter and intervention nodes 
that are included in an augmented- extended version of the design network. As- 
signment actions A might have an effect on the original collection of conditional 
independence statements, Sg, and new conditional independencies, Se, which 
can be derived from the experimental DAG will be obtained when action A = a* 
is taken, thus changing the structure of the original DAG. Recording decisions, 
will have an effect on the set of variables that will be available to us through 
the available (sic) experimental data. Thus R{B) will not introduce any new 
(in) dependencies in the structure, but will be relevant when discussing potential 
identifiability of effects given assignment actions A and the set of conditional 
independencies in Se- 

The simple case of ‘pure’ (i.e. non-stratified) individual random allocation 
(contrasted to the ‘no experiment’ decision- i.e. observational data) will be used 
to introduce this procedure. When the policy assignment is done through random 
allocation, two control actions are performed: randomisation and intervention. 
So treatment t* is set. At = do{T = t*), according to a probability distribution 
9t totally fixed and controlled by the experimenter through Ag^ = do{0T = 9t)- 
Figure 3(a) shows the augmented- extended design network where node A from 
Figure 2(c) is expanded to include these two actions. Nodes Ag^ and At stand for 
atomic interventions and follow the same degenerate distributions as defined for 
F. The parameter node 9t denotes the probability distribution used to allocate 
the treatments. When no randomisation is done, so that Ag^, = 0, 9t is left 
to vary ‘naturally’ according to 9t = p{At \ B) (thus capturing the possibly 
unobserved effect of background variables on the choice of policy that will be 
done). When pure random allocation is done, 9t will be set to 9t = 91), = p*{At) 
and by doing{9T = 9),) the link (r) from B to 9t is broken, making explicit that 
the probability for the allocation does not depend on any background variable 
B, so 9),1LB. This 9), will correspond to the probability of ‘observing’ T = t* in 
the experimental available data. Node At is used to emphasise the fact that the 
policy intervention value, t*, is set externally by the experimenter. Although At 
and Ft have the same form. At = do{T = t*) represents an action that was done 
according to 9), (in the past, when allocating policy) that define the structure of 
the data available (in the present), and Ft will be used to represent the atomic 
action of a (future) intervention Ft = do{T = t') to address identifiability of 
causal effects P{Y \ do{T = t'), •). In general, the set {t*} of policy-intervention- 
values assigned in the experiment through At is not necessarily the same as the 
set of future-policy-values {t'} defined by future policy Ft- However, if {t'} C 
{t*}, the positivity condition (see Dawid 2002) is satisfied. When this is not the 
case, some parametric assumptions need to be made for P{Y \ T, •) before the 
policy effects can be identified from the data available. 

Although the whole information required is contained in the augmented- 
extended design network, an experimental augmented ID is shown in Figure 3 
to make explicit the different structures of the (experimental) data available 
obtained for both situations, namely when random allocation is performed 3(b) 
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(a) Augmented -Extended Design Network 
for Random Allocation 




(b) Experimental ID for evaluation of 
Future Policy Fj under A^= 0*j 




(c) Experimental ID for evaluation of 
Future Policy Fj under Agj-= 0 



Fig. 3. Augmented-Extended Design Network and Experimental Influence Diagrams 



and when no random allocation is carried out 2>(c). From Figure 3(b), we can 
see that after random allocation is performed, the conditional independence 
{TALB)e and {Y ALFt \ T)e are additional to the independencies in the original 
observational graph O included in So ■ The structure of this experimental graph 
is the same as the one in Figure 2(h) and it is easily shown that the causal 
effect is directly identified by P{y \ t' ,F t = do{T = t'),Ag^ = do{0T = 0^)) ~ 
P{y I t' ,F t = = do{0T = 0t))- this case, the recording or not of B, 

R{B), is irrelevant for both identifiability and the type of effect measured. From 
3(c), we can see that if no random allocation is done and it is decided to use an 
observational study by setting Ag^ = 0, no ‘extra’ conditional independencies 
are gained. In this case, identifiability of the average effect will depend on the 
recording of B. Thus, if R{B) = 1 and B is recorded in the data available, the 
policy intervention effect on Y could be estimated using the ‘back-door formula’ 
(see Pearl (2000), Lauritzen (2001) and Dawid (2002)), such that P{y \ Ft = 
do{T = t'), Ag^ = 0, R{B) = 1) = /g P{y I t', B)P{B)dB. However, if R{B) = 0 
and B is not recorded, the treatment effect on Y is not identifiable from the 
available data. 

In general, we will say that the ‘causal’ effect of T on P is identifiable directly 
from available (experimental) data collected under De = dE, if learning the value 
of Ft (i.e. learning if the future policy was set to a value or left to vary ‘natu- 
rally’) does not provide any ‘extra’ information about the response variable Y 
given the value of T and experimental conditions De = dE (be. if T ALFt\T, De) 
then P{y \ 1 ',Ft = do{T = t'),DE = de) = P{y \ t',FT = %,De = de;). Note 
that this will hold (or not) regardless of R{B). However, when identifiability can- 
not be obtained directly from data defined by De, identifiability can still hold for 
a particular configuration of R{B). Then, we say that the causal effect is iden- 
tifiable through an Adjustment' procedure if P{y \ t' , Ft = do{T = t'),DE = 
dfi) = h{y,t',B* I De = ds) such that R{B*) = 1 for all B* G B* C B and h is 
a function of known probabilistic distributions of recorded variables under dg. 
In this latter case, the design d^; is said to be ignorable (see Rosenbaum and 
Rubin(1983) and Rubin(1978)). 

The efficacy of experimental design interventions De could then be measured 
in terms of making the (causal) effects of Ft = do{T = t') identifiable and then 
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two (or more) experiments can be compared in these terms. A general algorithm 
for the evaluation of design decisions De, using design networks is: 

1 . Build the complete ID including all influences between policy nodes, back- 
ground variables (observable and unobservable) and the response, (e.g. Fig 1(a)) 

2. Construct the (extended) design network, adding all potential treatment 
assignment mechanisms A, recording mechanisms R{B), and future policy to be 
evaluated F. (e.g. Fig “i(a)) 

3. For each potential treatment assignment mechanism A construct its cor- 
respondent experimental ID. (e.g. Fig i(h) and 3(^c^) 

4. Using Se and identifiability conditions, construct a set of possible design 
consequences (in terms of (type of) identifiability of Ft = do{T = t')) for each 
different set of experimental De = {A, R{B)}. 

5. Define some utility function over the set of design consequences. 

6. Among the experimental decisions De, choose the one with highest utility. 

To illustrate steps 4-6, imagine we could establish that the utilities asso- 
ciated with obtaining direct identifiability, adjusted (back-door) identifiability 
and unidentifiability are given by Ue, Ua and Ujj, respectively. Then for the 
pure random allocation vs. observational case we have that for the four possible 
combinations of (A, i?(B)): 





Experimental Decisions 


Design Consequence 


Utility 


De 


Ag^ 


R{B) 


P{y 1 FT = t',DE) 


U 


1 


do(0T = 9t) 


1 


direct identifiablity 


Ud 


2 


do{9T = 9t) 


0 


direct identifiablity 


Ud 


3 


Ag^ = 0 


1 


adjusted identifiability 


Ua 


4 


Ag^ = 0 


0 


No identifiable 


Uu 



Both experimental decisions that include random allocation, Ag^ = do{0T = 
9^), have the same utility associated in terms of identifiability and are equiv- 
alent in these terms. However, performing an experiment (randomising and/or 
recording) will typically have a cost associated that should be included in the 
utility for a complete evaluation of the designs. Making a difference between 
the utility associated with direct and adjusted identifiability might sound suspi- 
cious. However, although policy effects can be identified in both situations, some 
maximum likelihood estimates will not generally have the same efficiency (see 
Lauritzen(2001)). A broader discussion can be found in Madrigal (2004). 

5 Conclusions 

Design networks and experimental influence diagrams were introduced by in- 
corporating explicitly nodes for policy assignment A and recording mechanisms 
R. This demonstrates to be very useful to show clearly the experimental design 
intervention consequences in the graph. In particular, certain policy assignment 
mechanisms, such as randomised allocation, will add ‘extra’ independencies to 
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the ID defining a new collection of conditional independencies. The relevance of 
A to assist identification and the equivalence of two different mechanisms A\ and 
A 2 in terms of identifiability can be adressed under this framework for diverse 
types of assignment (for a broader discussion (see Madrigal (2004)). Record- 
ing mechanisms show to be relevant to assist non-direct identification of effects. 
Identifiability of effects, although a very important issue, is not the only thing 
needed when we face the inferential problem. The general idea was introduced for 
‘pure’ random allocation of policies in this paper, extensions of this to stratified 
and clustered assigments can be found in Madrigal (2004). 
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Abstract. Bayes-N is an algorithm for Bayesian network learning from data 
based on local measures of information gain, applied to problems in which there 
is a given dependent or class variable and a set of independent or explanatory 
variables from which we want to predict the class variable on new cases. Given 
this setting, Bayes-N induces an ancestral ordering of all the variables 
generating a directed acyclic graph in which the class variable is a sink variable, 
with a subset of the explanatory variables as its parents. It is shown that 
classification using this variables as predictors performs better than the naive 
bayes classifier, and at least as good as other algorithms that learn Bayesian 
networks such as K2, PC and Bayes-9. It is also shown that the MDL measure 
of the networks generated by Bayes-N is comparable to those obtained by these 
other algorithms. 

Keywords: Bayesian networks, data mining, classification, MDL, machine 
learning. 



1 Introduction 

There are several problems in scientific research in which there is a particular 
structure in the relationship between the variables under consideration. In problems of 
this type there is a dependent or response variable, and a set of independent or 
explanatory variables (for example in experimental design, regression models, 
prediction, classification problems, and so on). The underlying hypothesis is that the 
behaviour of the dependent variable can be explained as a result of the action of the 
independent variables. Although in some contexts a causal relationship is assumed, 
usually the assumption is restricted to state that there is some sort of association or 
covariation between variables. Probabilistic models (discrete, continuous or mixed) 
are applied when uncertainty is present. Bayesian networks [1][2][3] provide 
appropriate models in many of such cases, particularly when we are looking not only 
for the structure of the relationship as in the case of learning the network topology, 
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but also when we want to use the model as an inference machine, like for example in 
classification problems. The application of Bayesian networks to this kind of 
problems is not new [4], but there are several problems to be addressed, in particular 
those related to learning Bayesian networks from data and assessing their 
performance when applied to classification problems. 

In this paper we present Bayes-N, an algorithm for learning a Bayesian network 
from data, that takes advantage of the asymmetric relationship (dependent vs. 
independent covariation be discrete variables) in problems of the type just described, 
generating an ancestral ordering among variables based on local measures of 
information gain [5] [6] [7] [8] and controlled statistical tests of conditional 
independence. We show that the algorithm performs well compared with similar 
algorithms, and that for classification problems its performance is better than the 
naive bayes classifier [4] and performs at least as well as K2 [9], Pc [7] and Bayes9 
[8]. Also the MDL measure of goodness of fit of the network to the data from which it 
was obtained compares to the MDL of networks obtained by other well known 
algorithms as the ones just mentioned. 



2 A Statistical Test for Conditional Independence 

In 1959 S. Kullback proposed a statistical test for conditional independence based on 
information measures [10] which was later generalized [11] and has been widely used 
in association with loglinear models [12]. It has also been used in the context of 
undirected graphical models related to multivariate statistical analysis [13]. Here, a 
simple variant of the test is applied to induce the structure of directed acyclic graphs 
as explained below. For the case of testing marginal independence for two discrete 
variables, Kullback’ s test is equivalent to the usual chi-square test. 

Kullback’s Test for Conditional Independence. Let X, Y, Z, be discrete random 
variables with given joint, conditional and marginal probability distributions. The 
information gain on X given by Y is defined as 1(X/Y) = H(X) - H(X\Y), and the 
conditional information on X given Y and Z is defined by I(X/Z,Y)=H(X/Z)-H(X\Z,Y). 
H(X), H(X/Y),and H(X/Z,Y) are the entropy and the conditional entropies defined as 
usual [14]. 

By XJ.Y we mean X and Y are marginally independent. By X±Y,Z we mean X is 
conditionally independent of Y given Z. 

A test for marginal independence between X and Y can be set up in terms of 
information gain measures as follows: 

Ho:/(A|F) = 0,(Aiy) (D 

Hi :/(X|F)>0, {X Iy) 

Let K(X\Y) denote the estimator of I(X\Y) obtained from the sample, and let N be 
the sample size. Kullback shows that under H„, the statistic T=2NK(X\Y) is 
asymptotically distributed as a chi-square variable (with appropriate degrees of 
freedom). Under H^, T will have a non-central chi-square distribution. Thus we can 
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perform a test of independence based on K(X\Y). In fact T is closely related to the 
statistic frequently used in test related to loglinear and graphical models [12][15]. 
T is used here to emphasize the particular application we have in mind. 

A test for conditional independence of X and Y given Z, as proposed by Kullback, 
is based on the information gain on X given by Y once Z is known. The hypotheses to 
be tested are: 



Ho :/(Z|Z,F) = 0, (X 17,Z) (2) 

H, :/(Z|Z,F)>0, {X lY,Z) 

The statistic T=2NK(X\Z,Y), where N is the sample size and K(X\Z,Y) is the 
estimator of I(X\Z,Y), is asymptotically distributed as a chi-square variable under H^. 
Under Hj, T has a non-central chi-square asymptotic distribution. 

The proposed algorithm performs test for conditional independence as just 
described above, in an order that depends on measures of information gain, making 
Bonferri's adjustment and defining a threshold for the information gain as described 
below. 



Bonferroni’s Adjustment. In general, if we have k independent tests, each with level 
of significance of OC and all null hypotheses are true, the probability of obtaining 
significant results by chance is 1 - ( 1 - a)^. This probability is called the global 
significance level , a^, when many tests of hypothesis are made. [16] [17] 

Bonferroni’ s Adjustment consists of choosing the global significance level = 
0.05, say, and computing the significance level for each individual test: 



a = 







(3) 



where k is the number of independence tests made by the algorithm and is calculated 
as: 



n P (4) 

k =^^(n-0 + 7, with j<(n-t) 

1=1 j=a 

Please note that OC < a^, which reduce the risk of overfitting when the number of 
tests is large, i.e. the network will have more arcs than necessary. 

Percentage of Information Gain. Most algorithms for learning Bayesian networks 
based on marginal and conditional independence tests, use a threshold to determine if 
the variables at issue are independent or not. For large databases, small values of the 
information statistic will appear to be significant, even after Bonferroni’s adjustment 
has been performed [16]. Bayes-N includes an additional criterion to decide if two 
variables are independent based on the percentage of information gain, defined by: 

^ , I(X\ Y) 

%1 = testing marginal independence 



( 5 ) 
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%/ = 



I(X\Z,Y) 

H(X\Z) 



in the case of testing conditional independence 



(6) 



This measure defines an indifference region [16] for deciding what amount of 
information is relevant, specifying an information gain threshold that must be 
exceeded before declaring two variables to be (or not to be) marginally independent 
or conditionally independent. 



3 The Algorithm 

Bayes-N belongs to a family of algorithms with which we have been working for 
some time[6], [8], [18]. We make the following assumptions: 

a) The variables in the database are discrete. 

b) There are no missing values in all the cases of the database 

c) Given the true probabilistic distribution (the true model), the cases occur 
independently 

d) The number of cases in the databases is enough to make the independence 
and conditional independence tests reliable. 

Basically the algorithm works as follows. Let X be the dependent variable, and Y = 
{Yj,Y 2 ,...,Y^ j the set of independent variables. Let p be the depth of the tests; i.e. the 
maximum number of variables in the conditional set. Be a^, the global significance as 
defined previously. The total number of independence tests to be performed is k. Let 
a be the local significance, and the threshold for information gain. %/ is the 
percentage of information of the dependent variable provided by the independent 
variable being tested. During the first iteration it performs tests for marginal 
independence between the dependent variable {X) and the independent variables 
{Yj,Y^,...,Y^. Next, the independent variables are ordered in terms of I(X\Y). If the 
hypothesis of independence is rejected then a directed arc is drawn from Y^,^, the 
variable with greatest /(Z|F), to Z and then tests of conditional independence based on 
I(X|Y,,Y^jj) ((1) i ) are performed to decide if any other arcs are to be drawn from 
some T.’s to X. Next we condition on the two variables with greatest I(X\Y), and 
repeat the test of conditional independence applied only to those variables connected 
to X in the previous step and decide which arcs, if any, are to be deleted. The 
procedure is repeated until we have conditioned on p variables. We make X=Y^j^, 
repeat the procedure and iterate until no more variables are available. Bonferroni’s 
adjustment and the information threshold criterion are embedded in the statistical 
tests. 

Remark. The global significance, ag,the depth p, and the minimal percentage of 
information, are parameters defined by the user. 



4 Classification 

A classifier is a function that maps a set of instances (attributes) into a specific label 
(class) [19] [4]. In the data mining terminology, it is commonly accepted that the term 
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classification refers to the prediction of discrete, nominal or categorical variables’ 
values while the term prediction refers to the prediction of continuous variables’ 
values [20]. 

There exist different methods to test the performance, in terms of accuracy, of any 
classifier such as holdout, cross-validation and bootstrap [19] [20]. In this paper, the 
holdout and cross-validation methods are used to check the accuracy of the model 
built by Bayes-N. 

In the holdout method, the common practice is to randomly partition the data in 
two different mutually exclusive samples: the training and the test sets. The test set is 
also called the holdout set. The size of the training set is usually 2/3 of the data and 
the remaining 1/3 of the data corresponds to the size of the test set. The accuracy of 
the model built by the induction algorithm (in this case Bayes-N) is the percentage of 
the test set cases that are classified to the correct category they belong by the model. 
In other words, the class to which each case in the test set truly belongs is compared 
to the prediction, made by the model, for that same instance. The reason for 
partitioning the data this way is to avoid overfitting the data since, if the accuracy of 
the model were estimated taking into account only the training set, then some possible 
anomalous features which are not representative of the whole data could be included 
in this subset and the estimate may possibly not reflect the true accuracy. Thus, a test 
set has to be selected and used to test the robustness of the model, in the sense of 
making correct classifications given noisy data. 

Thus, the overall number of correct classifications divided by the size of the test set 
is the accuracy estimate of the holdout method. Formula 7 [19] shows how to 
calculate the accuracy using the holdout approach. 

acch = ^ Y^d{I{Dt,Vi),yi) 

^ {vi,yi)eDh 

where I(Dj, v,) denotes the instance w built by inducer I on data set D, (the training set) 
which is assigned the label yj and tested on the test set D^; h is the size of the test set. 
5(i,j)=l if i=j and 0 otherwise. This means that the loss function used for calculating 
the accuracy in the holdout method is a 0/1 loss function, which considers equal 
misclassification costs. 

The variance of the holdout method is estimated as follows [19]: 

accx(l-acc) .on 

yar= vo; 

h 

where h is the size of the test set. 

Another common approach, which is also used to measure classification accuracy 
and which will also be used here to calculate the accuracy of the classifier induced by 
Bayes-N, is the k-fold cross-validation. In the k-fold cross-validation method, as 
described in [19] [20], the complete dataset is randomly partitioned in k mutually 
exclusive subsets (called also the folds) Dj, D^, ..., D,^ of approximately equal size. 
The induction algorithm is trained and tested k times in the following way: in the first 
iteration, this algorithm is trained on subsets D^, ..., and tested on subset Dj; in the 
second iteration, the algorithm is trained on subsets Dj, Dj ..., and tested on subset 
Dj and so on. The overall number of correct classifications from the k iterations 
divided by the size of the complete dataset is the accuracy estimate of the k-fold 
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cross-validation method. Formula 9 shows how to calculate the accuracy of the 
cross-validation approach. 

acccv = — \ D(i),Vi),yi) 

^ (vi.yOeOj,.) 

where I(D\D(,j,Vj) denotes the instance v^ built by inducer I on data set which is 

assigned the label yj and tested on the test set D,„; n is the size of the complete dataset 
D. 5(i,j)=l if i=j and 0 otherwise. As in equation 7, this means that the loss function 
used for calculating the accuracy in the cross-validation method is a 0/1 loss function, 
which considers equal misclassification costs. 

Equation 10 shows the formula for the estimation of the variance in this method 
[16]: 

acccvX(l- acccv) 

Varcv = 

n 

where n is the size of the complete dataset D. 

It is very important to stress the advantage of Bayes-N against some search and 
scoring algorithms regarding the performance in classification tasks. In the case of 
Bayes-N, the Bayesian network is built by adding or deleting arcs (according to which 
is the case) taking into account only a pair of nodes and the nodes in the conditional 
set; i.e., the rest of the nodes are not considered in the analysis. This is why the 
information gain measures used in Bayes-N are called local. In other words, the 
construction of the network does not depend on a global scoring function, such as 
MDL (minimum description length) or BIC (Bayesian information criterion), that 
evaluates the entire network every time an application of an operator (such as adding, 
deleting or reversing arcs) is carried out. In cases where there are many attributes, 
these global scoring functions fail to minimize local errors that have to do with the 
classification performance of the resultant network. That is to say, although this 
network produces a good MDL score, it may perform poorly as a classifier [4]. 



5 Results 

Three different databases were used to test the performance of five different 
algorithms. The first one is called ALARM. ALARM stands for “A Logical Alarm 
Reduction Mechanism”. This database was constructed from a network that was built 
by Beinlich [9], [21] as an initial research prototype to model potential anaesthesia 
problem in the operating room. ALARM has 37 variables (nodes) and 46 arcs. From 
these 37 variables, 8 variables represent diagnostic problems, 16 variables represent 
findings and 13 variables represent intermediate variables that connect diagnostic 
problems to findings. Each node (variable) has from 2 to 4 different possible values. 
The size of the sample of the ALARM database is 10,000 cases. The class node is 
Blood Pressure (Node 5). 

The second database is a real-world database that comes from the field of 
pathology and has to do with the cytodiagnosis of breast cancer using a technique 
called fine needle aspiration of the breast lesion (FNAB) [22], [23] (Stephenson et al. 
2000), which is the most common confirmatory method used in the United Kingdom 
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for this purpose[23]. It contains 692 consecutive specimens of FNAB received at the 
Department of Pathology, Royal Hallamshire Hospital in Sheffield during 1992-1993 
[23]. 11 independent variables and 1 dependent variable form part of such a dataset. 
The independent variables are: age, cellular dyshesion, intracytoplasmic lumina, 
“three-dimensionality” of epithelial cells clusters, bipolar “naked” nuclei, foamy 
macrophages, nucleoli, nuclear pleomorphism, nuclear size, necrotic epithelial cells 
and apocrine change. All these variables, except age, are dichotomous taking the 
values of “true” or “false” indicating the presence or absence of a diagnostic feature. 
Variable age was actually sorted into three different categories: 1 (up to 50 years old), 
2 (51 to 70 years old) and 3 (above 70 years old). The dependent variable “outcome” 
can take on two different values: benign or malignant. In the case of a malignant 
outcome, such a result was confirmed by a biopsy (where available). 

The third database is called ASIA [24]. ASIA has 8 variables and 8 arcs. This 
database comes from a very small Bayesian network for a fictitious medical example 
about whether a patient has tuberculosis, lung cancer or bronchitis, related to their X- 
ray, dyspnoea, visit-to-Asia and smoking status; it is also called "Chest Clinic". Each 
node (variable) has 2 different possible values. The probability distributions for each 
node are described at the Norsys Software Corporation web site[24]. The size of the 
sample of the ASIA database is 1,000 cases. 

Table 1 shows the classification performance, using the holdout method [16], of 
five different algorithms: Naive Bayes[4], Bayes-N, K2 [9], Pc [7] and Bayes9 [8]. 
Table 2 shows the classification performance, using the 5-fold cross-validation 
method, of the same five different algorithms. Accuracy and standard deviation are 
shown in these tables. Table 3 shows the MDL for the networks as used for the 
holdout method. 



Table 1. Classification performance for the holdout method of Naive Bayes, Bayes-N, K2, 
Tetrad and Bayes9. 



Dat 


Naive 


Bayes-N 


K2 


Tetrad (Pc) 


Bayes9 


Alar 


62.65 + 


82.57 + 


82.57 + 


82.57 + 


82.57 + 


Can 


92.80 + 


95.34 + 


95.34 + 


92.80 + 


95.34 + 


Asia 


93.53 + 


96.18 + 


96.18 + 


95.01 + 


96.18 + 



Table 2. Classification performance for the 5-fold cross-validation method of Naive Bayes, 
Bayes-N, K2, Tetrad and Bayes9 



Data 


Naive 


Bayes-N 


K2 


Tetrad (Pc) 


Bayes9 


Alar 


61.52 + 


82.25 + 


82.25 + 


82.13 + 


82.79 + 0.38 


Cane 


93.00 + 


94.65 + 


94.51 + 


94.36 + 


94.36 + 


Asia 


89.63 + 


94.30 + 


94.20 + 


95.80 + 


95.80 + 



Table 3. MDL scores for the networks as used for the holdout method 



Data set 


Naive 


Bayes-N 


K2 


Tetrad (Pc) 


Bayes9 


Alarm 


143730.14 


78010.91 


71590.66 


78845.10 


79352.5 


Cancer 


2685.40 


2679.55 


2648.20 


2746.72 


2759.91 


Asia 


2387.20 


2213.34 


2215.23 


2211.79 


2211.34 
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6 Discussion 

As can be seen from Tables 1 and 2, the classification performance of Bayes-N is 
much better than that of naive Bayesian classifier and produces comparable results of 
those given by K2, Tetrad and Bayes9. The advantage of Bayes-N over naive bayes, 
besides from accuracy, is that it performs a correct subset feature selection getting rid, 
for the analysis, of variables that are not significant. Also, Bayes-N does not make the 
strong assumption, made by naive bayes, that the attribute variables are conditionally 
independent given the class variable. That is to say, Bayes-N considers that there may 
well be interactions among attributes, which in turn, can give more richness in the 
modelling and understanding of the phenomenon under investigation. In the case of 
the breast cancer dataset, the pathologists use the 1 1 independent variables in order to 
decide whether the patient has cancer or not (class variable) achieving a high overall 
classification performance [22]. Our initial hypothesis was that the naive Bayesian 
classifier, which by definition takes into account all these variables to make the final 
diagnosis, would produce much better results than those produced by the algorithms 
that only take a subset of these variables such as Bayes-N. However, as the results 
show, this was surprisingly not the case; i.e., Bayes-N outperforms naive bayes even 
when the former uses a subset of the whole set of attributes used by the latter. This 
can give indication that the local information gain measures used by Bayes-N 
represent a robust and accurate approach when building Bayesian network classifiers. 
Thus, with the reduction of redundant attributes, these measures also lead to the 
construction of parsimonious models. 

Compared to K2, a very notable feature of Bayes-N is that it does not need an 
ancestral ordering to build the network while K2 indeed does. In fact, Bayes-N 
induces this ancestral ordering using local information gain measures. In the case of 
K2, an ancestral ordering must be externally provided and the resultant network 
constructed by such a procedure is highly sensitive to this order; i.e., if the optimal 
ancestral ordering is not the one provided to K2, then the resultant network might be 
extremely inaccurate [9], [25]. 

Compared with Tetrad and Bayes9, Bayes-N gives direction to all the arcs in the 
network whereas the other two procedures may produce undirected arcs. Although 
this problem can be alleviated more or less easily in some problems, it still needs a 
certain amount of knowledge elicitation in order to direct these arcs. Even in large 
domains, as in the case of ALARM, Bayes-N seems to induce correct ancestral 
orderings, which allow to direct all the arcs among variables. 

Some limitations of Bayes-N can be mentioned. First of all, for the Bonferroni’s 
adjustment, it is necessary to make, a priori, the calculation of the number of 
independence or conditional independence tests for this adjustment to be included in 
the procedure. And finally, the percentage of significant gain is a parameter that has 
to be manually tuned. Unfortunately, as usually happens in the definition of a 
threshold, certain values for this parameter can lead to inaccurate results. 

As future work, we want to look for criteria that may allow for the automatic 
assignment of the percentage of significant information gain. 
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Abstract. This paper proposes a methodology that is useful for han- 
dling uncertainty in non-linear systems by using type-2 Fuzzy Logic (FL). 
This methodology works under a training scheme from numerical data, 
using type-2 Fuzzy Logic Systems (FLS). Different training methods can 
be applied while working with it, as well as different training approaches. 
One of the training methods used here is also a proposal — the One-Pass 
method for interval type-2 FLS. We accomplished several experiments 
forecasting a chaotic time-series with an additive noise and obtained 
better performance with interval type-2 FLSs than with conventional 
ones. In addition, we used the designed FLSs to forecast the time-series 
with different initial conditions, and it did not affect their performance. 



1 Introduction 

In this paper, we propose a general methodology for designing fuzzy logic systems 
(FLS) based on input-output pairs. The proposed methodology provides a way 
to handle uncertainty through the use of type-2 FLSs. Next, we will give a brief 
introduction to type-2 fuzzy logic (FL) and its fuzzy sets. 

Frequently, the knowledge that is used to construct the rules in a FLS is 
uncertain. This uncertainty leads to rules with uncertain antecedent and/or con- 
sequent, which in turns translates into uncertain antecedent and/or consequent 
membership functions (mf). The main sources of such uncertainties are: (1) 
When the meaning of the words that are used in the antecedents and conse- 
quents of the rules can be uncertain (words mean different things to different 
people). (2) When we do a survey to a group of experts, we might get different 
consequents for the same rule, because the experts not necessarily agree. (3) 
When we use noisy measurements to activate FLSs. (4) When the data that are 
used to tune the parameters of FLS are also noisy. All of these sources of uncer- 
tainty translate into uncertainties about the mf of fuzzy sets. Ordinary fuzzy sets 
(henceforth type-1 fuzzy sets), are not able to directly model such uncertainties 
because their mfs are totally crisp. Whereas type-2 fuzzy sets are able to model 
those uncertainties because their mfs are themselves fuzzy. In our experiments 
we had cases (3) and (4) (due to noisy data for training and testing), so we used 
type-2 fuzzy sets to correctly handle these uncertainties. 

General type-2 fuzzy sets are three-dimensional and the amplitude of their 
secondary mfs (called the secondary grade) can be in [0, 1]. When the domain 
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Fig. 1. FOU for Gaussian primary 
membership function with uncer- 
tain mean 




Fig. 2. Input and antecedent operations 
for an interval singleton type-2 FLS using 
product t-norm 



of a secondary mf (called the primary membership) bounds a region (called 
the footprint of uncertainty, FOU) whose secondary grades all equal one, the 
resulting type-2 fuzzy set is called interval type-2 fuzzy set, which can easily 
be depicted in two dimensions instead of three, see Fig. 1, where the solid line 
denotes the upper mf, and the dashed line denotes the lower mf. Moreover, since 
working with general type-2 fuzzy sets is computationally very intensive, we only 
used interval type-2 fuzzy sets, because they are not that intensive and they still 
can handle noisy data by making use of an interval of uncertainty. Systems using 
interval type-2 fuzzy sets are called interval type-2 FLSs [1], and they were used 
for illustrating the proposed methodolgy. We also used type-1 FLSs in order to 
have a reference of performance. 

2 FLSs with Different Fuzzifiers 

According to the type of fuzzification [2] , FLSs can be divided into singleton and 
non-singleton, which are described below. 

2.1 Singleton FLS 

In a singleton type-1 FLS, the fuzzifier maps a crisp point into a fuzzy singleton, 
which is a set that only has one element, the unit, i.e., ^a(x) = 1 for x = x' and 
IIa{x) = 0 for X yf x'. Thus, a singleton type-1 FLS can directly map the crisp 
inputs (i.e., the fuzzy singletons) into the membership values of the antecedents. 
With these membership values, the system computes a firing level by using a 
t-norm (minimum or product). Then, it applies a composition (max-min or max- 
product) to the firing level of each rule and its corresponding consequent (note 
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Fig. 3. Consequent operations for an in- Fig. 4. Input and antecedent operations 
terval type-2 FLS, with its fired output for an interval non-singleton type-2 FLS 
sets using product t-norm using product t-norm 



that the firing level is crisp value). Later, it combines the weighted consequents 
through the s-norm (max) in order to obtain an output fuzzy set, which is 
defuzzified to compute a crisp value that will be the system output. 

In an interval singleton type-2 FLS [1], the process of fuzzy inference can be 
briefly described as follows. First, fuzzification process is accomplished as shown 
in Fig. 2, which depicts input and antecedent operations for a two-antecedent 
single-consequent rule, singleton fuzzification, and, in this case, product t-norm. 
It is worth mentioning that, regardless the t-norm, the firing strength is an 

interval type-1 set F\ represented by its lower and upper mfs as [/^ / ]• 

Figure 3 depicts the weighted consequents for a two-rule {I = 1,2) sin- 
gleton type-2 FLS, where / is t-normed with the upper mf — /Xg; is the 

consequent of the rule i?* — , and is t-normed with the lower mf /x - . .The 



primary membership of 'i y&Y [i.e., the FOU{&)] is the darkened area. 

Figure 5 depicts the combined type-2 output set for the two-rule singleton type-2 
FLS, where the fired output sets are combined using the maximum t-conorm. 



The upper solid curve corresponds to 



/ 



V 



/ *Mg2 



for V y G Y, the 



lower dashed curve corresponds to 





for y yGY. The pri- 



mary membership of /x^(y) y y GY [i.e., the FOU{B)] is the darkened area 
between these two functions, and it is also an interval set. 

The next step, after fuzzy inference, is type-reduction. Regardless the type- 
reduction method we choose, and because of the fact that we are now dealing 
with interval sets, the type-reduced set is also an interval set, and it has the 
structure Ytr = [yz,yr]- We defuzzify it using the average of yi and y^', hence, 
the defuzzified output of an interval singleton type-2 FLs is y(x) = [yi + yr]/2. 
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Fig. 5. Combined output sets for the two 
fired output sets shown in Fig. 3 




Fig. 6. Block diagram of the methodol- 
ogy proposed for designing FLS 



2.2 Non-singleton FLS 

A non-singleton FLS has the same structure as a singleton FLS (fuzzifier -|- infer- 
ence engine -I- defuzzifier), and they share the same type of rules; the difference 
lies on the fuzzification. In a non-singleton FLS, the fuzzifier treats the inputs 
as fuzzy numbers; in other words, a non-singleton FLS models its inputs into 
fuzzy sets by associating mfs to them. 

Conceptually, the non-singleton fuzzifier implies that the given input value a;' 
is the most likely value to be the correct one from all the values in the immediate 
neighborhood; however, because the input is corrupted by noise, neighboring 
points are also likely to be the correct value, but to a lesser degree. 

Within the fuzzification process in a non singleton type-1 FLS, the crisp 
inputs to the system Xk establish the centers of the mfs of the input fuzzy sets 
i-6-> they are modeled as fuzzy numbers. These fuzzy numbers are used to 
compute the firing level of each rule by using a t-norm (minimum or product). 
Here, as in a singleton type-1 FLS, the firing level of a rule is again a crisp value, 
so that the process of inference and defuzzification described for that system can 
also be applied for a non-singleton type-1 FLS. 

In an interval non-singleton type-2 FLS [1], the process of fuzzy inference 
can be briefly described as follows. First, as part of the process to obtain the 
output from the system, it is necessary to determine the meet operation between 
an input type-1 set and an antecedent type-2. This task just involves the t-norm 
operation between the input mf and the lower and upper mfs of the antecedent. 
The result is an interval set denoted as (xk), and depicted by the thick (solid 
and dashed) lines in Fig. 4, where we show input and antecedent operations for a 
two-antecedent single-consequent rule, non-singleton fuzzification, and product 
t-norm. Regardless the t-norm used, the firing strength is an interval type-1 set 
[/^ f% where f and / = Observe that is the supremum of 

the firing strength between the t-norm of fj,x^ (xk) and the lower mf /r (xk)', and 

that fk is the supremum of the firing strength between the t-norm of fJ-x^i^Xk) 
and the upper mf Jipi (xk) (k = 1,2). Note that pLx^ {xk) is centered at Xk = x').. 
These t-norms are shown as heavy curves in Fig. 4. From these heavy curves it 
is easy to pick off their suprema. 
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As we can see, the result of input and antecedent operations is again an 
interval — the firing interval — , as in Fig. 2. Consequently, the process of inference 
and defuzzification described for the singleton type-2 FLS can also be applied 
for a non-singleton type-2 FLS, i.e., the results depicted in Fig. 3 and Fig. 5 
remain the same for an interval non-singleton type-2 FLS. 



3 Proposed Methodology 

The proposed methodology is oriented to model non-linear systems by learning 
from numerical data, not from human knowledge. In other words, it is specifically 
for designing FLS from input-output pairs. Note that, as the methodology is for 
handling uncertainty in non-linear systems, it focuses mostly on type-2 FLSs 
(since they can model the noise from the training set by means of interval type- 
2 mfs); however, it can also be applied with type-1 FLSs, but their performance 
is not quite as good as when there are no sources of uncertainty. Figure 6 shows 
the block diagram of this methodolgy. The steps are described below: 

1. Obtaining data. First, you have to obtain the input-output pairs from 

which the FLS is going to be trained. Usually these data pairs are given in 
advance (e.g., from historical data of a process), but if they are not provided, 
it is necessary to obtain them by doing simulations (as in our experiments, 
where we forecasted a time-series). 

2. Defining the training and testing sets. The data (obtained in the pre- 

vious step) should be divided into two groups in order to form the training 
set and the testing set. 

3. Setting the number of antecedents per rule. The number of an- 

tecedents that each rule will have is equal to the number of inputs to the 
FLS. 

4. Defining the type of FLS. Here, you specify the characteristics of the sys- 

tem that will be used, such as the type of the fuzzifier (singleton, non- 
singleton), the shapes of the mfs (triangular, trapezoidal, Gaussian), the 
type of composition (max-min, max-product), the type of implication (min- 
imum, product), and the type of defuzzifier (centroid, center average) 

5. Training methods. The training of the system and the elaboration of its 

rules depend on the training method used, see Section 4. 

6. Applying the FLS. When the system has already been trained with the 

training set, it can be used for obtaining the ouputs from FLS by using now 
the testing set. The output of the trained system and the desired output are 
used to compute the system performance. 

Note that while working with FLSs of the same type (i.e., systems with the same 
definition in step 4), but with different training methods (e.g., one-pass method 
or gradient descent method), the systems share the first four steps, which allows 
us to make a more reliable comparison. 
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4 Training Methods 

The methods used here for training the FLSs were the one-pass method and the 
gradient descent method (also called steepest descent method). 



4.1 One-Pass Method 

In this method the system parameters are not tuned, because once the rules 
have been obtained from the training set, they will not change at all. In fact, 
the data establish directly the centers of the fuzzy sets in both the antecedents 
and consequent of each rule (typically, the remaining parameters of the mfs are 
preset by the designer). Hence, the number of rules is equal to the number of 
training data pairs N. 

The procedure for training a type-2 FLS with the one-pass method is very 
similar to the exposed above (which is for a type-1 FLS); but now, the mfs 
used in the antecedents and consequents are interval type-2 fuzzy sets, instead 
of the conventional ones. In this case, the training set is going to be employed 
to establish the interval of uncertainty, or more specifically, to define the left 
and right means of each mf, see Fig. 1. Since the training of the system is still 
directly done from the N noisy input-output pairs, the total of rules will be N, 
as in the type-1 FLSs. 

Note that, when training a system with this method, the type of fuzzifier is 
irrelevant, because the method only focus on establishing the parameters of the 
rules, whereas the mf parameters modeling the inputs (which are not considered 
as part of the rules) are simply preset by the designer. In other words, rules 
produced in a singleton FLS are exactly the same as those from a non-singleton, 
as long as they both are type-1 FLSs or type-2 FLSs. 



4.2 Gradient Descent Method 

Unlike the one-pass method, this method does tune the system parameters, by 
using a gradient descent algorithm and N training data pairs. Its objective is 
to minimize the function error: = ^[/(x^*^) — i = 1,. . . ,N where 

/(x*^*^) is the value obtained by training the system with the input-output pair 
(x(d : y(*)). In this case, all the parameters of the rules will be tuned according 
to this error [2]. These parameters should be initialized with values that have 
a physical meaning (as suggested in [1], see Table 1) so that the algorithm 
converges faster. The parameters are tuned as the epochs increase — here, an 
epoch is defined as the adjustment made to the parameters in a cycle that 
covers the N training pairs only once. The algorithm stops when a predefined 
number of epochs have been reached or when the error has become smaller than 
an arbitrary value. 

Typically, when using the gradient descent method, the parameters to be 
tuned depend on the type of fuzzifier employed. For example, when using a 
singleton fuzzifier, the parameters to be tuned are those concerning only the 
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mfs of the antecedents and consequents of the rules. Whereas when using a non- 
singleton fuzzifier the parameters to be tuned are those concerning the mfs that 
model the inputs (typically the spread) in addition to the ones mentioned above. 

Besides, with this method we can apply different training approaches — or 
tuning approaches. In this way, the dependent approach uses parameters from 
another FLS already designed (with the best performance) to update a new 
FLS (for example, using a singleton FLS to design a non-singleton). This is 
done by keeping all of the parameters that are shared by both FLSs fixed at the 
values already adjusted, and tuning only the remaining parameter(s) of the new 
system. On the other hand, in the independent approach all the parameters of 
the new FLS are tuned. If a FLS has already been designed, then we can use its 
parameters as initial values or seeds; we called this case the partially independent 
approach, and the one with no seeds, the totally independent approach. 



5 Experiments 



In order to illustrate the methodology proposed in Section 3, we did some ex- 
periments forecasting the Mackey-Glass time series [4], represented as: 



ds{t) 

dt 



0.2s(t — r) 

1 -I- s^^{t — t) 



O.ls(t) 



( 1 ) 



For T > 17, (1) exhibits chaos. In our simulations we used r = 30. Next, we will 
give a short description of each step. 



1. Obtaining data. In order to generate the data for the training and test- 

ing sets, (1) was transformed into a discrete time equation, and we applied 
the Euler’s approximation method with a step size equal to 1. Then, we 
corrupted those data with a 0 dB additive noise. We generated 50 sets 
of this uniformly distributed noise n(k), and added them to the noise- free 
time series s{k); so, we got 50 noisy data sets x{k) = s{k) + n{k), where 
fc = 1001, 1002,..., 2000. 

2. Defining the training and testing sets. We formed the training and 

testing sets from the 1000 data obtained in the previous step. The train- 
ing set was formed by the first 504 data, i.e, cc(lOOl), x(1002), . . . ,a;(1504). 
The testing set was formed by the remaining 496 data. 

3. Setting the number of antecedents per rule. The number of an- 

tecedents per rule that we used for forecasting the time-series was p = 4, 
i.e., x{k — 3),x(fc — 2),x{k — l),x{k), from which we obtained x{k + 1). 
Note that we only designed single-stage forecasters (with four-antecedent 
single-consequent rules), since, even though the system error may be very 
tiny, iterative forecasts might increase that error in every iteration, up to 
undesired values [5], [6]. 

4. Defining the type of FLS. As we did several experiments, there are dif- 

ferent definitions of FLSs, but all of them used product implication, max- 
product composition. Type-1 FLSs used type-1 Gaussian mfs for the rule 
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Table 1. Initial values of the parameters using the gradient descent method. Each 
antecedent is described by two fuzzy sets {i = 1, ... ,M and k = 1, ... ,p) 



FLS 


Input 


For each antecedent 


Consequent 


STIFFS 


N/A 


mean: — 2(72, or mx + 2a:^\, apt 


f e [0, 1] 


NSTIFLS 


<J X — O'n 


mean: mx — 2ax or mx + 2ax', apt 


f e [0, 1] 


ST2FLS 


N/A 


mean: [mx — 2ax — 0.25an,mx — 2ax + 0.25a„] 
or [nix + 2ax — 0.25an,mx+2ax+0.25an]-, 
al = 2ax 


= y\ - O-n 
yl=f + an 


NST2FLS 


<7X — ^71 


mean : [mx — 2ax — 0.25an, nix — 2ax + 0.25an] 
or [mx + 2ax-0.25an,nix+2ax+0.25a„]-, 
al = 2ax 


y\ = y\ - O-n 

yl = y^ + an 



fuzzy sets, center-average defuzzifier, and whether singleton fuzzifier or Gaus- 
sian non-singleton fuzzifier. Type-2 FLSs used Gaussian primary mfs with 
uncertain mean for the rule fuzzy sets, interval weighted average type- 
reducer, and whether singleton defuzzifier or Gaussian non-singleton fuzzifier 
(modeling the inputs as type type-1 Gaussian mfs because the additive noise 
was stationary). 

5. Training methods. According to the definitions made in the previous step, 

we applied both the one-pass method and the gradient descent method in 
order to train the FLSs. In the former, we obtained systems with 500 rules 
(because the training set was formed by N=500 input-output data pairs); 
whereas in the latter, we obtained systems of only 16 rules (i.e. 2 fuzzy sets 
per antecedent, 4 antecedents per rule, giving a total 2'^ = 16 rules). Note 
that the number of rules in the gradient descent method did not depend on 
the size of the training set; it was prefixed at 16 rules. In this method, the 
parameters were initialized using the formulas of the Table 1. 

6. Applying the FLS. Once the systems were trained, we checked their per- 

formance with the testing set. This performance was evaluated by computing 
the following RMSE (root mean-squared error): 

I 1999 r 

RMSE = [s(fc + 1) - /(x(0)] (2) 

As we did 50 simulations for each experiment, we obtained 50 values of 
RMSE] so in order to display an overall performance of each system, we 
calculated the average and the standard deviation of these 50 RMSEs. Table 
2 shows the performance of the FLSs trained with the one-pass method. 
Figure 7 displays the performance of the FLSs trained with the gradient 
descent method. Note that, because of space matter, we only show the per- 
formance obtained with the totally independent approach; but in the experi- 
ments that we did with the other two approaches, we observed that the error 
could attain its minimum faster (i.e., in earlier epochs), and that the gap be- 
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Table 2. Mean and standard de- 
viation of the RMSEs from FLSs 
trained with the one-pass method 
in 50 simulations 



Table 3. Mean and standard deviation of the 
RMSEs obtained in 50 simulations forecasting 
the original time series and a second time series, 
both with additive noise 



FLS RMSE (thmse FLS RMSE grmse RMSE 2 o-rmse 2 



Singh TIFLS 


0.1954 


0.0082 


S. TIFLS 


0.1566 


0.0129 


0.1566 


0.0112 


Singh T2FLS 


0.1784 


0.0077 


N-s. TIFLS 


0.1553 


0.0123 


0.1553 


0.0107 


Non-singh TIFLS 


0.1371 


0.0075 


S. T2FLS 


0.1486 


0.0107 


0.1486 


0.0091 


Non-singh T2FLS 


0.1370 


0.0075 


N-s. T2FLS 


0.1486 


0.0106 


0.1486 


0.0089 




Fig. 7. Mean (a) and standard deviation (b) of the RMSEs from FLSs trained with 
the gradient descent method in 50 simulations. Parameters were tuned for six epochs 
in each simulation 



tween the type-1 and type-2 FLSs was enlarged (since these approaches are 
benefited from parameters tuned in previous systems) . 

In addition, after being designed, FLSs trained with the gradient descent 
method were used for forecasting the Mackey-Glass time series, but now 
with noise in the initial conditions (for practical purposes, it is the same as 
simply having different initial conditions) . The resultant time series was also 
corrupted by a 0 dB additive noise. Table 3 shows the performance obtained 
in those FLSs. Note that, since there was no tuning in this case, we do not 
illustrate the error per epoch; we only exhibit the performance in the sixth 
epoch (from Fig. 7), as well as the performance obtained with this second 
noisy time series. Observe that the behavior of the systems is very similar 
in both time series, i.e., the averages of the RMSEs obtained in the first 
time series coincide with those obtained here; what’s more, the standard 
deviations were smaller. 
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6 Conclusions and Further Work 

The methodology proposed here, was easily applied for designing different types 
of FLS forecasters with distinct methods and different approaches. Furthermore, 
it can be directly extended for working not only with time-series forecasters but 
also to process modeling. 

In the experiments, the gradient descent method generated systems with a 
small number of rules, and we obtained very good results, even when experiments 
with the time-series were proved for robustness. In these latter experiments, the 
trained systems were robust enough as to handle the noise in the initial con- 
ditions and the additive noise in the testing data sets. On the other hand, the 
system with the smallest error was that designed with the one-pass method; 
however, it does not mean that it was the best one, because if we had an envi- 
ronment where a quick time response is crucial, this system would not be very 
satisfactory, because it has many rules (which make it extremely slow for doing 
only one forecast). In contrast, the gradient descent method typically generates 
FLSs with fewer rules, consequently the time to produce a single output value 
is shortened. This feature encourages its use in real-time applications over the 
one-pass method. 

As a futher work, the systems designed with the one-pass method might re- 
duce their number of rules generated by using the Wang’s Table Look-up scheme 
[3], adapting it to work with type-2 fuzzy sets. In systems designed with the 
gradient descent method, the total of rules can be reduced (if necessary) by ap- 
plying the Mendel’s SVD-QR method [2] . In the case that the additive noise in 
the environment is not stationary, we can use a non-singleton fuzzifier that mod- 
els the inputs by means of type-2 fuzzy sets (remember that we only modeled 
the inputs with type-1 fuzzy sets, because of a stationary additive noise). 
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Abstract. This paper presents the utilization of influence diagrams in 
the diagnosis of industrial processes. The diagnosis in this context sig- 
nihes the early detection of abnormal behavior, and the selection of the 
best recommendation for the operator in order to correct the problem 
or minimize the effects. A software architecture is presented, based on 
the Elvira package, including the connection with industrial control sys- 
tems. A simple experiment is presented together with the acquisition and 
representation of the knowledge. 



1 Introduction 

In recent years, the power generation industry has faced important problems 
that require the modernization of current installations, principally in both the 
instrumentation and control systems. The current trend consists in increasing 
the performance, availability and reliability of the actual installations. The per- 
formance refers to the amount of mega watts that can be generated with a unit 
of fuel. The availability refers to the hours that the central stops generating, 
and the reliability refers to the probability of counting with all the equipment of 
the plant. Additionally, modern power plants are following two clear tendencies. 
First, they are very complex processes working close to their limits. Second, they 
are highly automated and instrumented, leaving the operator with very few de- 
cisions. However, the classic control systems are programmed to stop the plant 
under the presence of abnormal behavior. Some decisions can be taken in the 
supervisory level that control systems are unable to make, i.e., to reason about 
the abnormal behavior and the probable consequences. 

This research group at Electrical Research Institute or HE, has been working 
in the design of On-line intelligent diagnosis systems for gas turbines of power 
plants [5,3]. This project includes two special challenges. First, the management 
of uncertainty given the thermodynamic conditions of the gas turbine and the 
difficulty of constructing accurate analytical models of the process. Second, the 
continuous acquisition of the turbine parameters. The problem here is the in- 
terconnection with different data networks, different proprietary databases, and 
probably different field buses, in order to from the correct state of the turbine. 
This allows the early detection of small deviations and defines the recommended 
actions for maintaining the generation of electricity in optimal condition. 
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Diagnosis is the technique utilized in several fields, devoted to find faults that 
explain abnormal behavior in a system. Several approaches have been proposed 
and they can be classified in three kinds [1]. 

Data-driven : based on large amount of data given by modern control and 
instrumentation systems, from which meaningful statistics can be computed. 
Analytical : based on mathematical models often constructed from physical 
first principles. 

Knowledge— based : based on causal analysis or expert knowledge, conclusions 
and inferences are made given information of the process. They can be found 
in several kinds of models and inference methods [7]. 

The selection of the best approach for a given problem depends on the quality 
and type of available models, and on the quality and quantity of data available. 

This paper presents a knowledge based diagnosis system that manages the 
natural uncertainty found in real applications. The model is composed by Influ- 
ence diagrams, i.e., a Bayesian network representing the probabilistic relation- 
ship of all the important variables, plus additional nodes that utilize decision 
theory for the selection of the best corrective action. 

The diagnosis architecture presented in this paper is part of a larger system 
formed by a monitor, an optimizer, and a diagnosis module. When the monitor 
detects that the process works normally, it runs the optimizer in order to increase 
the performance of the process, e.g., the generation of more mega watts per 
unit of gas. On the opposite, when the diagnosis module detects an abnormal 
behavior, it identifies the faulty component and generates advices to the operator 
in order to return the plant to its normal state. This optimizer and diagnosis 
system are devoted to enhance the performance and availability indices. 

This paper is organized as follows. First, section 2 introduces the influence di- 
agrams and presents a very simple example of their use in industrial applications. 
Section 3 describes the software architecture developed for the construction of 
a prototype of the on-line diagnosis system (DX). Next, section 4 presents an 
application example running coupled with a gas turbine simulator and discusses 
the results obtained. Finally, section 5 concludes the paper and addresses the 
future work in this area. 

2 Influence Diagrams 

An influence diagram is a directed acyclic graph consisting of three types of 
nodes [6]: 

1 . Zero or more chance nodes that represent propositional variables. For exam- 
ple, turbine -normal that can be true or false, or temperature -gas that can 
be {high, medium, low}. They are represented by circles in the diagram. 

2. Zero or more decision nodes that represent the possible choises available to 
the decision maker. They are represented by squares in the diagram. 

3. One utility node whose value is the expected utility of the outcome. This 
node is represented by a diamond in the diagram. 
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An influence diagram can be seen as a typical Bayesian network (the chance 
nodes) plus the decision and utility nodes. Thus, the arcs into chance nodes 
represent the variables from which the nodes are conditionally dependent. 
The arcs into decision nodes represent the variables that will be known by 
the decision maker at the time the decision is made. The arcs into the utility 
node show which variables participate in the calculation of the utility values. 
Figure 1 shows a basic influence diagram. Node failure is a Boolean variable 
representing the chance of having a failure in the turbine. Node Insp is a two 
value decision: to make or not to make an inspection. Node Ut considers the 
values of node failure and node Insp to calculate the utility value of the 
possible scenarios. Table 1 describes an example of the four possible scenarios 
and their utility values. Every scenario corresponds to the combination of the 
node Ufs parents, i.e., nodes failure and Insp. +d stands for the decision 
of making the inspection and +/ stands for existence of a failure, —d and 
— / represents the opposite. The domain experts calculate the utility values 
considering the cost of executing the inspection and the possible problems 
caused if the inspection is not carried out. For example, the lowest value (3) 
corresponds to the combination of presence of the failure but the decision of no 
inspection to the turbine {Ut{+f,—d)). On the contrary, the highest value (10) 
corresponds to the decision of no inspection but without failure {Ut{— f ,—d)). 

Consider the following a priori probabilities of failure node: P{+f = 
0.86, — / = 0.14). The expected utility value for a decision is calculated as follows: 

Ut{+d) = [P{+f) X u{+d, +/)] + [P{-f) X u{+d, -/)] 

= (0.14 X 8) + (0.86 X 9) = 8.86 

where u(x) is the utility value of the scenario x established in the utility node as 
in Table 1. Similarly, Ut(—d) = 9.02. Therefore, the maximum expected utility, 
given the current knowledge about the state of the turbine is obtained as the 
max{Ut{+d),Ut{—d)) = 9.02. The best decision is not interrupt the functioning 
of the turbine with an inspection. However, if the probability of failure vector 
of node failure changes, then the expected utility would change accordingly. 
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Table 1. Example of values for the utility node. 



ut 


-ff 


-f 


+d 


8 


9 


-d 


3 


10 



Influence diagrams represent an appropriate technique to provide the maximum 
expected utility, i.e., the optimal decision, given the current knowledge about 
the state of the process. 

The next section describes the software architecture implemented for a ex- 
periment in a gas turbine simulator using this influence diagram. 



3 Software Architecture 

A typical knowledge based, on-line diagnosis system utilizes a general structure 
as shown in Fig. 2. First, the acquisition of the model is required. This can be 
obtained with an automatic learning algorithm and historical data. Also, the 
model is complemented utilizing expert knowledge about the operation of the 
gas turbines. Once that the model has been obtained, it is inserted in the on- 
line diagnosis system (DX in the figure). The DX reads real time values of the 
variables that participate in the diagnosis. This variables act as the evidence 
from which the posterior probabilities will be calculated. Figure 2 shows also 
an utility function entering the DX. This is required since the calculation of the 
utility values may depend on some parameters that may change continuously. 
For example, some utility function may require the cost of the MW/hour or 
the cost of the fuel in dollars or Mexican pesos. This utility function can be 
evaluated and updated every time is needed. Finally, a graphic user interface 
(GUI) is needed so the notiflcation of the optimal decision can be given to the 
operator. Also, commands can be asserted to the system or the introduction of 
new operation parameters. 



Utility function 



exnerts 




learning 







Real time data 






GUT 



Fig. 2. General diagram of the diagnosis system architecture. 
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Fig. 3. Internal architecture of the diagnosis system. 



The DX module shown in Fig. 2 is the main program of the prototype. As 
mentioned before, the program utilizes the Elvira package [4]. The Elvira system 
is a Java tool to construct probabilistic decision support systems. Elvira works 
with Bayesian networks and influence diagrams and it can operate with discrete, 
continuous and temporal variables. However, Elvira was designed to work off- 
line with its specific GUI. Thus, one of the contributions of this work is the 
development of the interface class that allows utilizing the elemental classes of 
Elvira for influence diagrams capture and propagation. Figure 3 describes the 
internal architecture of the DX module. The DX classes module represents all 
the classes written at this location. They read the models, read the variables 
information (e.g. valid ranges of continuous variables), and control the exchange 
of information with the GUI. This module also controls the main loop of the 
diagnosis with the following steps: 

1. read the real time information, 

2. generate the evidence 

3. propagate probabilities 

4. update utility values 

5. get the maximum expected utility, i.e., generate a recommendation 

The module Ptalk implements a data client over an OPG server (OPG stands 
for Object linking and embedding). This class can exchange data with commer- 
cial controllers and data bases like Siemens or SQL servers. 

The module Tdecision is the main contribution in this research project. It al- 
lows utilizing all the Elvira’s algorithms in an embedded program for the on-line 
diagnosis. This module prepares the information that elvira requires to load evi- 
dence, propagate probabilities and obtain results. The most important methods 
of Tdecision are the following: 

Constructor: reads and compile the file with the model. 

SetEvidence: writes the values of the known nodes. 

SetUtilityValues: write an update of the utility values. 

Propagate: commands the execution of the propagation of probabilities. 
GetUtility Values: calculates the expected utility values according to the prop- 
agation. 
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GetMaxUtility: calculates the maximum expected utility value and responds 
with the optimal decision. 

The next section describes a simple example executed in the prototype. 

4 Application Example 

The architecture has been utilized in diagnosis experiments in a gas turbine 
simulator at the laboratory. The simulation executed for this experiment consists 
in a increasing of load from 2 MW to 23 MW. Six analog signals were sampled 
every half second, so a number of 2111 records were obtained during the 20 
minutes. 

The learning module of elvira with the K2 algorithm [2] were executed uti- 
lizing the data table with seven variables and 2111 records. The K2 algorithm 
was chosen since it permits to suggest the structure of the network through 
the ordering of the variables in the table. This ordering was obtained with ex- 
pert advice. For example, the demanded power causes the position of the gas 
valve and this parameter causes the generation of power. Table 2 explain the 
identifiers, their description, and the number of intervals in the discretization. 
The number of intervals was chosen according to the value range and the re- 
quirements of granularity in the experiments. The Decision node can take three 
values: in — line revision, off — line revision, and do nothing. Figure 4 shows 
the resulting influence diagram involving the six variables, obtained by elvira' s 
learning modules. The MWdem variable represents the power demanded by the 
operator. In an automatic control system, this is the only set-up that the op- 
erator manipulate. The rest of the variables will move according to this signal. 
The position of gas control valve Vgas is the control variable that represents the 
aperture of the gas valve. Thus, if more power is demanded, more aperture will 
be read in this variable. Variable MW is the measure of the mega watts gener- 
ated. If the gas valve is opened, then the mega watts will increase. Finally, the 
TTXDi variables represent the exhaust gas temperature in the turbine, in dif- 
ferent parts of the circumference. The Utility node has 24 values corresponding 
to all the combinations between 8 values of MWdem and 3 values of Decision. 
Summarizing, the power demand causes the aperture of the gas valve, and this 



Table 2. Variables participating in the experiments. 



ID 


Description 


Num. states 


MWdem 


Demanded power 


8 


Vgas 


Valve of gas position 


10 


MW 


Generated power 


8 


TTXDI 


Exhaust gas temperature 1 


10 


TTXD2 


Exhaust gas temperature 2 


10 


TTXD3 


Exhaust gas temperature 3 


10 


Decision 


decision 


3 


Utility 


Utility 


24 
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Fig. 4. Influence diagram obtained with automatic learning and experts advice. 

aperture causes the generation of mega watts. The generation also produces an 
increment (or decrement) of temperature gases. Several tests were made with dif- 
ferent discretization values in order to see the best result in the learning process. 
Also, different information was provided to K2 algorithm for different structures 
as explained above. 

This experiment considers that if the value of power demand MWdem and 
the aperture of the valve V gas is known, then a diagnosis of the turbine can be 
made and the decision of in — line revision, or off — line revision of nothing 
can be obtained. Also, the calculation of the utility values can be made using 
the decision and the power demand values. The utility function considers the 
cost of taking a decision and the gain obtained if this decision is made in the 
current situation. For example, if more power is demanded, the associated cost 
of stopping the turbine for an off-line revision is much higher and the benefit 
may not be high since full power were demanded. On the contrary, if there were 
a fault detected and the decision is to make a major revision, then the cost is 
high but the benefit is also very high. In this case, do nothing can be cheap but 
the benefits can be negative, e.g., a major malfunctioning of the turbine. 

Therefore, the utility function defined in this experiments consists on an 
equation depending on the parents of the utility node, i.e., the decision made 
and the demand of power. The equation relates the cost of the decision made 
and the gain obtained. More experiments are being carried out to define the best 
equation that produces the optimal results according to controlled scenarios. 
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Table 3. Results obtained in ten executions of the prototype. 



MWdem 


Vgas 


MW 


TTXDl 


TTXD2 


TTXD3 


Decision 

Off — line In — line nothing 


16,5 


38,2 


22,7 


353,5 


331,5 


503,8 


10.0 


11.0 


5.0 


14,4 


41,2 


12,3 


370,6 


224,2 


454,7 


8.0 


9.0 


4.0 


15,4 


59,5 


13,4 


217,2 


156,1 


433,7 


8.0 


9.0 


4.0 


17,5 


77,9 


12,3 


183,1 


401,6 


435,7 


10.0 


11.0 


5.0 


9,2 


47,3 


4,1 


302,4 


525,9 


234,3 


4.0 


5.0 


2.0 


15,4 


65,7 


12,3 


424,7 


107,0 


510,9 


8.0 


9.0 


4.0 


2,0 


48,4 


19,6 


141,1 


392,6 


180,1 


0.0 


1.0 


0.0 


2,0 


47,3 


18,5 


214,2 


268,3 


325,5 


0.0 


1.0 


0.0 


11,3 


60,6 


8,2 


384,6 


453,7 


417,7 


6.0 


7.0 


3.0 


23,7 


69,7 


16,5 


284,4 


376,6 


417,7 


0.0 


15.0 


0.0 



Table 3 shows the results obtained when the program carries out ten different 
cycles of execution (at random) . The first six columns describe the current value 
of the corresponding variables and their assigned interval. The value at the left 
is the real value read by the system, and the value at the right represents the 
number of discretization interval corresponding to the value. For example, in 
the first cell, variable MWdem has 16 MW that corresponds to the interval 5 
in the discretization. The last three columns correspond to the expected utility 
values for each possible decision. For example, in the first row, the best decision 
to recommend is the execution of an in-line revision (value of 11), and the worst 
decision is to do nothing (value of 5) . The rest of the rows also decide the in-line 
revision but with different levels of intensity. For example, the last row suggests 
in-line revision with 15 to 0, in contrast to 11 to 10 units in the first row. 

These results of course depend on the knowledge captured by the experts 
in two aspects. First, the structure of the influence diagram, and second the 
numerical parameters required in the reasoning. They are, the a — priori and 
conditional probabilities of the chance nodes, and the utility values stored in 
the utility node. These ten runs of the prototype demonstrated the advantage of 
performing an in-line diagnosis together with the recommendation of the optimal 
action in the current situation. 



5 Conclusions and Future Work 

This paper has presented an on-line diagnosis system for gas turbines. The sys- 
tem utilizes influence diagrams that use probabilistic reasoning and decision the- 
ory techniques. The probabilistic models are constructed with automatic learning 
algorithms inspired with expert knowledge. The system is implemented based 
on Elvira package and other classes designed by this research group. The main 
strength of the influence diagrams is the formal representation and manipulation 
of the models: probabilistic and decision/utility. An optimal recommendation is 
issued based on the current information of the state of the turbine. One of the 
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contributions of this work is the design of the Tdecision class. It allows to in- 
sert new evidence, to change utility values if needed and to call the propagation 
routines of elvira. This propagation produces two different results. One, the 
maximum expected utility that defines the optimal action and second, the pos- 
terior probability of all the chance nodes that may describe the behavior of the 
turbine. 

Experiments on a gas turbine simulator have shown the feasibility for the 
use of this technique and this prototype in more complex and real applications. 
The next task will be to acquire real data from a real turbine in a power plant, 
to run the automatic learning algorithms for obtaining more accurate models 
and question the experts to recognize which information is required to take a 
decision, and which information is required to calculate the utility value of the 
decisions. These two steps will provide a better influence diagram for a specific 
turbine. 

Future work is going in this direction. 
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Abstract. This paper proposes a new approach for online fault diagno- 
sis in dynamic systems, combining a Particle Filtering (PF) algorithm 
with a classic Fault Detection and Isolation (FDI) framework. Of the 
two methods, FDI provides deeper insight into a process; however, it 
cannot normally be computed online. Our approach uses a preliminary 
PF step to reduce the potential solution space, resulting in an online al- 
gorithm with the advantages of both methods. The PF step computes a 
posterior probability density to diagnose the most probable fault. If the 
desired confidence is not obtained, the classic FDI framework is invoked. 
The FDI framework uses recursive parametric estimation for the residual 
generation block and hypothesis testing and Statistical Process Control 
(SPC) criteria for the decision making block. We tested the individual 
methods with an industrial dryer. 



1 Introduction 

Fault diagnosis is increasingly being discussed in literature [1] and is the subject 
of international workshops and special journal issues [13,14]. Industrial appli- 
cations require adequate supervision to maintain their required performance. 
Performance can decrease due to faults, which generate malfunctions. Malfunc- 
tions of plant equipment and instrumentation increase the operating costs of any 
plant and can sometimes lead to more serious consequences, such as an explosion. 

Supervision of industrial processes has become very important. Usually this 
supervision is carried out by human operators. The effectiveness of human su- 
pervision varies with the skills of the operator. 

In this paper, we consider processes that have a number of discrete modes or 
operating regions corresponding to different combinations of faults or regions of 
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qualitatively different dynamics. The dynamics can be different for each discrete 
mode. Even if there are very few faults, exact diagnosis can be quite difficult. 
However, there is a need to monitor these systems in real time to determine 
what faults could have occurred. We wanted to investigate whether we could do 
real-time fault diagnosis by combining a principled, probabilistic PF approach 
with classic EDI techniques. 

Particle Filtering [4] is a Markov Chain Monte Carlo (MCMC) algorithm 
that approximates the belief state using a set of samples, called particles, and 
updates the distribution as new observations are made over time. PF and some 
variants have been used for fault diagnosis with excellent results [2,16,11,12]. 

Our FDI framework uses a recursive Parametric Estimation Algorithm (PEA) 
for the residual generation block and hypothesis testing and Statistical Process 
Control (SPC) criteria for the decision making block. 

PEA is used for online identification in time- varying processes [10], iden- 
tifying the current discrete mode of the process [19,20]. A supervision block 
ensures robustness and correct numerical performance, specifically, initialization 
by means of the adaptation algorithm, parameter estimate bounds checking, 
input-output persistent detection, oscillation detection, disturbances and set- 
point changes, signal saturation, and bumpless transfer (mode transition). 

The paper is organized as follows: Section 2 describes the industrial process, 
mathematical models and experimental tests. Section 3 presents the PF algo- 
rithm as a fault diagnosis approach and discusses its performance. Section 4 
describes the residual generation and decision making blocks and their incorpo- 
ration into the FDI framework. Section 5 discusses the strengths and weaknesses 
of both methods and describes the combined approach. Lastly, we suggest future 
directions. 



2 Processes Monitored 

To test the fault diagnosis algorithms, we worked with a common, real-world 
process. An industrial dryer was adapted for experimental purposes, allowing 
faulty points to be repeatedly implemented. The industrial dryer is a thermal 
process that converts electricity to heat. The dryer is able to control the exit air 
temperature by changing its shooting angle. The generated faults were imple- 
mented using the fan speed (low/high), fan grill (closed/open), and dryer exit 
vent (clear/obstructed). Using only measurements taken with a temperature sen- 
sor, we wanted to diagnose the most probable faulty point online. Fig. 1 shows 
an actual photo and a schematic diagram of a dryer with potential faulty points. 

Mathematical model. Normal operation corresponds to low fan speed, an 
open airflow grill, and a clean temperature sensor. We denote this discrete mode 
Zt = 1. We induced different types of faults: z* = 2 (faulty fan), Zt = 3 (faulty 
grill), Zt = 4 (faulty fan and grill), etc. We applied an open- loop step test for 
each discrete mode [18]. Using the monitored data (yt,ut) and a PEA [15], an 
Auto-Regressive with exogenous variable model, ARX (ria, nt,, d), was proposed 
for each discrete mode Zt, 
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Fig. 1. Industrial dryer. This dryer uses a motor-driven fan and a heating coil to 
transform electrical energy into convective heat. The diagram shows the sensor and 
potential faulty points. 



Un = ai{zt)yn-l H 1- a„^{zt)yn-n^ + bi{zt)Un-l-d H 1- bnb{zt)Un-m-d 

( 1 ) 

where na and rib are the number of coefficients a(-) and 6(-); d represents the 
discrete dead time. From the ARX model we can get the deterministic state 
space representation using a standard control engineering procedure: 



xt = A{zt)xt-i + B{zt)jt + F{zt)ut (2) 

yt = C{zt)xt + D{zt)vf (3) 

yt denotes the measurements, Xt denotes the unknown continuous states, and Ut 
is a known control signal. Zt € { 1 , ■ • ■ ,nz\ denotes the discrete modes (normal 
operation, faulty fan, faulty grill, etc)^ at time t. The process and measurement 
noises are i.i.d Gaussian: y* ~ and Vt ~ J\f{0,I). The parameters A(-), 

B{-), C(-), D{-), and F(-) are matrices with > 0. Zt ^ p{zt\zt-i) is a 

Markov process. This model is known as a, jump Markov linear Gaussian (JMLG) 
model [2], and combines the Hidden Markov and State Space models. The noise 
matrices B{-) and D{-) are learned using the Expectation-Maximization (EM) 
algorithm [7]. The left plot of Fig. 2 shows a graphical representation of the 
JMLG model. 

Experimental tests. We physically inserted a sequence of faults, according 
to a Markov process, and made appropriate measurements. The right plot of Fig. 
2 compares the real data with that generated by the JMLG model. The upper 
graph shows the discrete mode of operation over time, and the lower graph makes 
the comparison. The JMLG model successfully represents the dynamics of the 
system. 

The aim of the analysis is to compute the marginal posterior distribution 
of the discrete modes _p (2o:i|yi:tWi:t)- This distribution can be derived from 

^ There is a single linear Gaussian model for each realization of Zt- 
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Fig. 2. JMLG model. The left plot is a graphical representation of the model. The 
right plot demonstrates its ability to model the system. 



the posterior distribution p (xq:*, zo-t\yi-.t, ui-.t) by standard marginalization. The 
posterior density satisfies the following recursion^: 



P {Xo-.t, Zo-.t\yi-.t) = P {xo:t-l,Zo-t-l\yi-.t-l) 



p {yt\xt, zt) p {xt, zt\xt-i, zt-i) 
p{yt\yi-.t-i) 



( 4 ) 



Equation (4) involves intractable integrals; therefore numerical approximations 
such as Particle Filtering (PF) are required. 



3 Particle Filtering Algorithm 

In the PF setting, we use a weighted set of particles {(xq!), to 

approximate the posterior with the following point-mass distribution: 

N 

mixo-.t, Zo,t\yi-.t) = y2 Z 0 :t), (5) 

2=1 

where S (>) m (xo-,t, zo-,t) is the well-known Dirac-delta function. At time t — 1, 
N particles are given {xq^I_j^, approximately distributed according 

to p(xq] 1_;^, Zg’:l_;^lyi:t-i). PF allows us to compute iV particles 
approximately distributed according to p{xq\^ Zq.] \ yi-.t) at time t. Since we cannot 
sample from the posterior directly, the PF update is accomplished by introducing 
an appropriate importance proposal distribution q{xo,t, zo-,t) from which we can 
obtain samples. Fig. 3 shows a graphical representation of this algorithm^, which 
consists of two steps: Sequential Importance Sampling (SIS) and selection. This 
algorithm uses the transition priors as proposal distributions, q{xo-,t, zo-t\yi-.t) = 
p{xt\xt-i, zt)p{zt\zt-i), so the important weights Wt simplify to the likelihood 
function p{yt\xt, Zt). 

^ For clarity, we omit the control signal ut from the argument lists of the various 
probability distributions. 

® For simplicity, xt was omitted 
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The earliest PF implementations were based only on SIS, which degenerates 
with time. [9] proposed a selection step which led to successful implementations. 
The selection step eliminates samples with low importance weights and multiplies 
samples with high importance weights. 





Fig. 3. PF algorithm. An approximation of p[zt-i\yv.t- 2 ) is obtained through un- 
weighted measure ;^}^i at time t—1. For each particle the importance weights 

are computed at time t — 1, which generates an approximation of 

p{zt-i\yi-.t-i) ■ The selection step is applied and an approximation of p[zt-i\yi-.t-i) is 
obtained using unweighted particles Note that this approximated distri- 
bution and the previous one are the same. The SIS step yields which is 

an approximation of p{zt\yi-.t-i) at time t. 



Fault diagnosis results. Given the real observations over time, we tested 
the PF algorithm with different numbers of particles N, left graph in Fig. 4. We 
define diagnosis error as the percentage of time steps during which the discrete 
mode was not identified properly. We use Maximum A Posteriori {MAP) to 
define the most probable discrete mode over time. There is a baseline error rate 
resulting from human error in timing the discrete mode changes (these changes 
were manually implemented). 

The right graphs in Fig. 4 compare the true discrete modes with the MAP 
estimates generated by the PF algorithm. The overall diagnosis error is shown 
in the left graph. As we can see, PF improves as the number of particles grows. 

To test the stability of our PF algorithm, we engineered some variations 
in the transition matrix p{zt\zt-i). Despite these changes, the diagnosis error 
remained the same. This is predicted by Bayes’ theorem. All inference is based 
on the posterior distribution which is updated as new data becomes available. 
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Fig. 4. Diagnosis error for the industrial dryer. The left graph is a box and whisker plot 
of the diagnosis error for different numbers of particles. The right graphs compare the 
true mode with the MAP estimate over time for two runs of 800 and 3,200 particles. 



4 FDI Approach 

FDI is a set of techniques used to detect and isolate faults, sometimes with in- 
complete information. An FDI system has 3 stages [6]: fault detection (noticing 
an abnormal condition), fault isolation (locating the fault), and fault identifi- 
cation (determining the fault’s size). A residual generator block receives the 
input /output process signals and generates the residual'^. There are different 
methods for residual generation; we use parametric estimation. A decision mak- 
ing block performs the fault isolation and identification. We use hypothesis tests 
and Statistical Process Control (SPC) criteria for the decision making block. 

Residual Generation block. Parametric Estimation Algorithm (PEA) as 
a residual generation block is an analytical redundancy method [17]. First it 
is necessary to learn a reference model of the process in a fault-free situation 
(discrete mode Zt = 1). Afterwards, model parameters are learned recursively. 
The behavior of these estimates allow the detection of an abnormal situation 
and fault isolation. 

Assuming that faults are reflected in physical parameters, the basic idea of 
this detection method is that parameters of the real process are learned recur- 
sively; then the current learned parameters are compared with the parameters in 
a fault-free discrete mode. Any substantial difference indicates a change in the 
process which may be interpreted as a fault. One advantage of the PEA is that 
it gives the size of the deviations, which is important for fault analysis. However, 
the process requires a persistent excitation in order to learn the parameters on- 
line. Additionally, the determination of physical parameters from mathematical 
ones generally is not unique and is only possible for low-order models. 

Decision Making block. This block is implemented using hypothesis tests 
and Statistical Process Control (SPC) criteria. In hypothesis testing, a null hy- 

^ Inconsistency between real signals and synthetic signals. 
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pothesis (Hq) and an alternative hypothesis (i?i) are defined in relation to the 
parameters of a distribution. The null hypothesis must be accepted or rejected 
based on the results of the test. Two types of error can occur, a and (3 are the 
probability of error type I and II respectively. The hypothesis tests most likely 
to be used in the FDI context are shown in Table 1^. 



Table 1. Hypothesis tests for FDI. 



Null hypothesis 


Alternative hypothesis 


Rejection criteria 


Statistical test 


i7o : M = Mo 
known 


77i : M / Mo 
77i : M > Mo 
i7i : M < Mo 


\Zq\> Zaj2 
Zo > Za 
Zo < Za 




i7o : M = Mo 
unknown 


i7i : M / Mo 
77i : M > Mo 
77i : M < Mo 


|to| > ta/2,v 
io iz — Tl 1 


, _ x-,10 
S/V^ 


Ho :a^ = erg 


Hi : a" / ot 
Hi : > ol 

Hi : cr^ < Oo 


Xo > Xa/ 2 ,i/ or Xo < Xl-a/ 2 , 1 / 
Xo > xl,i^ 

Xo < xl^, u = n-l 


,,2 _ (n-l)S2 

Xo — ^2 

^0 



According to the SPC criteria [5] , an out-of-control situation may be detected 
when from 7 consecutive points, all 7 fall on the same side of the central limit. 
Alternatively, we can make the same determination when 10 of 11, 12 of 14, 14 
of 17, or 16 of 20 consecutive points are located on the same side of the central 
limit. 

Fault Diagnosis approach. The residual generation block is implemented 
by an online PEA. ARX parameters ({a(-)i}”=u {H-)j}]Li) ^tre associated with 
each discrete mode Zt ■ Given this, it is possible to define hypothesis tests for the 
fault-free model {zt = 1) versus the faulty discrete modes hypothesis {zt > 1). 
Under this interpretation [5], Hq is the fault-free discrete mode. Hi is a faulty 
discrete mode, a is the probability of a false alarm, i.e. detecting a faulty discrete 
mode when there is no fault, and (3 is the probability of failing to detect a faulty 
discrete mode. When any of the null hypotheses are rejected, a faulty discrete 
mode will be declared. It will be necessary to test the hypotheses for every faulty 
discrete mode. 

This characterization of faults allows fault isolation, because the most likely 
fault will be the one that does not reject (accepts) the greatest number of null 
hypotheses (the greatest number of parameters) corresponding to it. In other 
words, for fault isolation, we implement a voting scheme based on the number 
of non-rejected null hypotheses. Table 1 shows the SPC criteria. 

Assuming a normal distribution for the parameter behavior, hypothesis tests 
can now be performed using the known mean values and variance for the normal 
discrete mode Zt = 1; see Table 2. 

Symbols used in this table are standard. 
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Table 2. Hypothesis tests for each parameter, where i = 1, ... ,Ua and j — 1, ... ,rib. 



Ho 


fJ-ai = HaiiZt = 1) 


o-ii = = 1) 


= fj-bj («t = 1) 


= ^'bi {zt = 1) 


Hi 


^ 1) 


Coi ^ cri.{zt = 1) 


f^bj 7^ ^.bjizt = 1) 


^b.j 7^ = 1) 



If one hypothesis is rejected, it will be understood that there is at least an 
incipient fault. To avoid false alarms, SPC criteria should be applied to these 
tests, so a behavior, rather than a single test, indicates the presence of a fault. It 
should be noted that there will be a propagation error in fault isolation, because 
testing the estimated parameters against each set of values corresponding to 
each faulty discrete mode involves having a and j3 errors for each test. 

5 Combined Approach 

PF algorithms have several advantages in the fault diagnosis context. They per- 
form online diagnosis dealing with several potential discrete modes, thereby giv- 
ing low diagnosis error. Nevertheless, PF has some problems: 

— Diagnosis results have some degree of uncertainty because PF is a numerical 
approximation. This uncertainty depends on many factors. If the posterior 
probability distribution shows more than one mode, there is uncertainty in 
the model selection. 

— The MAP estimate only gives the most probable fault. There is no allowance 
for the second most probable fault, etc. 

— The JMLG model is not updated online. Individual parameter changes can- 
not be detected because the global effect could be masked by the model 
structure. 

The FDI approach also has important advantages, namely that it provides 
a deeper insight into the process. If the relationship between model parameters 
and physical parameters is unique, the fault isolation is easy, and it provides 
direct fault identification. PDFs primary limitations are: 

— The diagnosis is mainly offline; there is a lag of ”n” time steps in the com- 
puting of statistics. 

— The procedure for recursively learning parameters is expensive and imprac- 
tical because the process must be persistently excited. 

We propose a combined system, incorporating both the online speed of PF 
and the insight available from FDI. The system consists of two sequential stages: 
Stage 1: PF. Using PF, a posterior probability density is computed given 
the observations online. If the diagnosis confidence is high enough (see the left 
example distribution in Fig. 5), a diagnosis can be made (discrete mode 6 in the 
example) . However, if the density does not permit a confident diagnosis (see the 
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right example distribution) , we proceed to the FDI stage. Note that even in this 
case, PF has reduced the size of the potential solution space. 

Stage 2: FDI. Considering only the most likely discrete modes from the PF 
stage (modes 5, 6, and 7 in the right example distribution, Fig. 5), a hypothesis 
test is computed with respect to the current learned discrete mode given by the 
PEA block. These hypothesis tests use a detailed model parameter analysis in 
order to find the right model. 

Stage 1 significantly reduces the number of candidate discrete modes, allow- 
ing stage 2 to diagnose online. 



TK 



PF 

• algorithm I 




23456789 10 23456789 10 



Candidate 

model 



Candidate 

models 




Fig. 5. Robust fault diagnosis sequence. 



5.1 Conclusions and Future Work 

The PF and FDI approaches complement each other. The combined system 
generates less uncertainty than PF alone, yet unlike pure FDI it can still be 
computed online. 

Some important future improvements would be a method for detecting new 
faults (discrete modes) and a better method for recursive learning of the JMLG 
model. There have been some advances along these lines [8,3], but more research 
is needed to obtain reliable solutions. 
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Abstract. Markov decision processes (MDPs) provide a powerful frame- 
work for solving planning problems under uncertainty. However, it is 
difficult to apply them to real world domains due to complexity and rep- 
resentation problems: (i) the state space grows exponentially with the 
number of variables; (ii) a reward function must be specified for each 
state-action pair. In this work we tackle both problems and apply MDPs 
for a complex real world domain -combined cycle power plant operation. 
For reducing the state space complexity we use a factored representation 
based on a two-stage dynamic Bayesian network [13]. The reward func- 
tion is represented based on the recommended optimal operation curve 
for the power plant. The model has been implemented and tested with 
a power plant simulator with promising results. 



1 Introduction 

Markov Decision Processes (MDP) [5,12] and Partially Observable Markov De- 
cision Processes (POMDP) [10] provide a solution to planning problems under 
uncertainty as they base its strength in the computation of an optimal policy, 
in accessible and stochastic environments. In these approaches, an agent can 
do observations, starting from which, it computes a probabilistic distribution of 
states where the system can be. Based on this position, it designs the optimal 
sequence of actions to reach a specific goal. 

However, in this formalism, the state space grows exponentially with the 
number of problem variables, and its inference methods grow in the number 
of actions. Thus, in large problems, MDPs become impractical and inefficient. 
Recent solutions like those shown in [7,8] introduce the use of featured-based 
(or factored) representations to avoid enumerating the problem state space and 
allow to extend this formalism to more complex domains. 

In recent years, the power generation industry has faced important problems 
that require the modernization of current installations, principally in both the 
instrumentation and control systems. The current trend consists in increasing 
the performance, availability and, reliability of the actual installations. The per- 
formance refers to the amount of mega watts that can be generated with a unit of 
fuel. The availability refers to the hours that the central is out of service, and the 
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reliability refers to the probability of counting with all the equipment of the plant. 
Additionally, modern power plants are following two clear tendencies. First, they 
are very complex processes working close to their limits. Second, they are highly 
automated and instrumented, leaving the operator with very few decisions. 

However, there still exist some unusual maneuvers that require the experience 
and ability of the operator. Some examples are load rejection or responses to 
failures [1]. A current strategy is the use of intelligent systems for the support 
in the decision process that the operator carries out. The intelligent system may 
learn the actions of an experimented operator and may advice and train the 
new operators in the decision processes that are required in special occasions. 
Unexpected effects of the actions, unreliable sensors and incompleteness of the 
knowledge suggest the use of uncertainty management techniques. 

This paper first explains the disturbances presented in a combined cycle 
power plant due to a load rejection and other similar situations, and states a 
possible solution by using an intelligent operator assistant. In section 3, the 
heart of the intelligent system is shown describing the main elements of the 
MDP formalism and its most popular inference methods. Section 4 illustrates a 
feature-based representation extending MDPs to deal with real-world problem 
complexity. In section 5, the implementation details of the operator assistant are 
shown. Finally, in section 6, experimental results and test cases are described. 



2 Problem Domain 

A heat recovery steam generator (HRSG) is a process machinery capable of 
recovering residual energy from a gas turbine exhaust gases, and use it for steam 
generation purposes (Fig.l). Its most important components are: steam drum, 
water wall, recirculation pump, and afterburners. The final control elements 
associated to its operation are: feedwater valve, afterburner fuel valve, main 
steam valve, bypass valve, and gas valve. 

During normal operation, the conventional three-element feedwater control 
system commands the feedwater control valve to regulate the steam drum level. 
However, when a partial or total electric load rejection is presented this tradi- 
tional control loop is not longer capable to stabilize the drum level. In this case, 
the steam-water equilibrium point moves, causing an enthalpy change of both 
fluids (steam and water) . Consequently, the enthalpy change causes an increment 
in the water level because of a strong water displacement to the steam drum. 
The control system reacts closing the feedwater control valve. However a water 
increase is needed instead of a feedwater decrease. A similar case is presented 
when a sudden high steam demand occurs. 

Under these circumstances, the participation of a human operator is neces- 
sary to help the control system to decide the actions that should be taken in 
order to overcome the transient. A practical solution to this problem is the use of 
an intelligent operator assistant providing recommendations to operators about 
how to make the best action on the process that corrects the problem. The op- 
erator assistant should be able to find an action policy according to the crisis 
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main 

steam valve 




Fig. 1. Feedwater and main steam systems simplified diagram. F fw refers to feedwater 
flow, Fms refers to main stream flow, dp refers to drum pressure, dl refers to drum 
level, and 3 — ecs refers to flow controller. 

dimension, take into account that actuators are not perfect and can produce 
non-desired effects, and consider the performance, availability and reliability of 
the actual plant installations under these situations. 



3 Markov Decision Processes 

Markov Decision Processes (MDP) [12] seems to be a suitable solution to this 
kind of problems as they base its strength in the computation of an optimal 
policy in accessible and stochastic environments. 

An MDP M is a tuple M =< S, A, <P, R >, where S' is a finite set of states of 
the system. A is a finite set of actions. <P : Ax S ^ n{S) is the state transition 
function, mapping an action and a state to a probability distribution over S for 
the possible resulting state. The probability of reaching state s' by performing 
action a in state s is written ^(a, s, s'), i? : S x A — >■ 5ft is the reward function. 
i?(s, a) is the reward that the system receives if it takes action a in state s. 

A policy for an MDP is a mapping tt : S ^ A that selects and action for each 
state. Given a policy, we can define its finite-horizon value function : S' — >■ 5ft, 
where (s) is the expected value of applying the policy tt for n steps starting 
in state s. The value function is defined inductively with (s) = R{s, 7t(s)) and 
Vj^(s) = R{s, tt{s)) + Us'£S^{t^{s) , s , s')V^_i{s'). Over an infinite horizon, a dis- 
counted model 7 is frequently used to ensure policies to have a bounded expected 
value. For some 7 chosen so that 0 < 7 < 1, the value of any reward from the 
transition after the next is discounted by a factor of 7 , and the one after that 7 ^, 
and so on. Thus, if V'"{s) is the discounted expected value in state s following pol- 
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icy 7 T forever, we must have V^{s) = R{s,tt{s)) + 
which yields a set of linear equations in the values of R^(). 

A solution to an MDP is a policy that maximizes its expected value. For the 
discounted infinite-horizon case with any given discount factor 7 in ( 0 , 1 ) range, 
there is a policy V* that is optimal regardless of the starting state [9] that satisfies 
the following equation: F*(s) = maXa{R{s, a) + s, s')P*(s')}. 

Two popular methods for solving this equation and finding an optimal policy 
for an MDP are: (a) value iteration and (2) policy iteration [12]. 

In policy iteration, the current policy is repeatedly improved by finding some 
action in each state that has a higher value than the action chosen by the current 
policy for the state. The policy is initially chosen at random, and the process 
terminates when no improvement can be found. This process converges to an 
optimal policy [ 12 ]. 

In value iteration, optimal policies are produced for successively longer finite 
horizons until they converge. It is relatively simple to find an optimal policy over 
n steps 7T*(.), with value function V^{.) using the recurrence relation: 7 r*(s) = 
arg maXa{R{s, a)+^Ss'eS^{(i, s, with starting condition lo*(-) = 0 

V s G S', where is derived from the policy as described earlier. The 
algorithm to the optimal policy for the discounted infinite case in a number of 
steps that is polynomial in | S |, | A |, log maXg^a \ R{s,a) \ and 1/(1 — 7 ). 

4 Factored MDPs 

The problem with the MDP formalism is that the state space grows exponen- 
tially with the number of domain variables, and its inference methods grow in the 
number of actions. Thus, in large problems, MDPs becomes impractical and inef- 
ficient. Factored representations avoid enumerating the problem state space and 
allow that planning under uncertainty in more complex domains to be tractable. 

In a factored MDP, the set of states is described via a set of random vari- 
ables X = {Ai, . . . ,Xn}, where each Xi takes on values in some finite domain 
Dom{Xi). A state x defines a value Xi G Dom{Xi) for each variable Xi. Thus, 
the set of states S = Dom{Xi) is exponentially large, making it impractical 
to represent the transition model explicitly as matrices. Fortunately, the frame- 
work of dynamic Bayesian networks (DBN) [14,4] gives us the tools to describe 
the transition model function concisely. In these representations, the post-action 
nodes (at the time t -I- 1) contain matrices with the probabilities of their values 
given their parents’ values under the effects of an action. 

These representations can have two types of arcs: diachronic and synchronic. 
Diachronic arcs are those directed from time t variables to time t -I- 1 variables, 
while synchronic arcs are directed between variables at time t -I- 1. Figure 2 
shows a simple DBN with 5 binary state variables and diachronic arcs only, to 
illustrate the concepts presented. 

A Markovian transition model defines a probability distribution over the 
next state given the current state. Let Xi denote the variable Xi at the current 
time and X[ the variable at the next step. The transition graph of a DBN is a 
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Fig. 2. A simple DBN with 5 state variables. 



two-layer directed acyclic graph Gt whose nodes are {Xi, . . . , Al„, X'^}. 

In the graph, the parents of X- are denoted as Parents{X'^). Each node X- is aso- 
ciated with a conditional probability distribution (CPD) P^{X[ \ Parents(Xl)). 
The transition probability | x;) is then defined to be IIiP,i,{x^ \ Ui) where 

Ui is the value in x of the variables in Parents{Xl). Because there are not syn- 
chronic arcs in the graph, the variables are conditionally independent at the time 
t+1. So, for an instantiation of variables at time t, the probability of each state 
at the time t -I- 1 can be computed simply by multiplying the probabilities of the 
relevant variables at the time t + 1. 

The transition dynamics of an MDP can be expressed by a separated DBN 
model <Pa =< Ga,Pa > for each action a. However, in many cases, different 
actions have similar transition dynamics, only differing in their effect on some 
small set of variables. In particular, in many cases, a variable has a default 
evolution model, which only changes if an action affects it directly. Roller and 
Parr [3] use the notion of a default transition model <Pd =< Gd,Pd >■ For each 
action a, they define Effects\a] C X' to be the variables in the next state whose 
local probability model is different from <Pd, i-O., those variables X' such that 
Pa{X'i I ParentSa{X-)) yf Pd{X'i \ Parentsd{X.)). Note that d can be an action 
in our model where Effects[d] = (f. If we define 5 actions oi, . . . , 05 and a default 
action d in the example of the DBN, the action Oi changes the CPD of variable 
X- and so Effects[ai] = X'. 



5 A FMDP for the Steam Generation System 

In this section, the components and the implementation details of the FMDP- 
based model for assisting operators during manual operations in the steam gen- 
eration system will be explained. First, a knowledge module that manages the 
transition, the reward and the observation matrices is required, i.e., the knowl- 
edge base. These matrices can be established once that a finite set of actions on 
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Fig. 3. Recommended operation curve for a Heat Recovery Steam Generator. 



each state and a finite set of observations are defined. Second, a decision process 
module is required where the MDP algorithm is implemented and interfaced 
with the knowledge and actors modules. Third, an actor module is the direct 
interface with the environment. In this case, the steam generator of a power 
plant is the environment and the operator is the actor. The actor provides the 
goals required and executes the commands that modify the environment. 

The set of states in the MDP are directly obtained from the steam genera- 
tor operation variables: feedwater flow (Ffw), main steam flow (Fms), drum 
pressure {Pd) and power generation {g), as the controlled variables; and the 
disturbance (d), as the uncertain exogenous event. Initially we consider the case 
of a ’’load rejection” as the disturbance. These variables are discretized in a 
number of intervals, so the state of the plant (a part of it) is represented by the 
combination of the state variables. 

For optimal operation of the plant, a certain relation between the state vari- 
ables must be maintained, specified by a recommended operation eurve. For in- 
stance, the recommended operation curve that relates the drum pressure and 
the main steam flow is shown in Fig. 3. The reward function for the MDP is 
based on the optimal operation curve. The states matching the curve will be 
assigned a positive reward and the reminding states a negative one. Given this 
representation, the objective of the MDP is to obtain the optimal policy for 
getting the plant to a state under the optimal operation curve. 

The set of actions is composed by the opening and closure operations in 
the feedwater {fwv) and main steam valves {msv). It is assumed that fwv and 
msv respectively regulates Ffw and Fms. Thus, Ofwv (-I-), Ofwv (— ), denote the 
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Table 1. Set of control actions. 



[Process Operations 


ao 


Ofwv( + ) 


ai 


Ofwv(-) 


02 


Omsv( + ) 


03 


Omsv(-) 


04 


0 



t t+1 




Fig. 4. Two-stage Dynamic Bayesian Network for the actions aO and al. 



action of opening/closing the fwv and, Ofms (+), Ofms (— ), the action of open- 
ing/closing msv. The null action (0) indicates no changes in these valves. Table 
1 shows the whole set of actions. Initially, only five actions will be used, however 
the formalism can be applied to greater set of actions and their combinations. 

The state variables are discretized as follows: Pd can take 8 values, Fms can 
take 6, and Ffw, d, and g can take two values, so the state dimension is 8^ x 6^ x 
2^ = 384. To reduce the number of states, we use a factorized representation 
based on two-stage dynamic Bayesian network [4]. 

The transition model in this work is represented through a two-stage Bayesian 
network with 5 state variables as shown in Fig. 4. The effects over the rele- 
vant nodes during the application of an action are denoted through solid lines. 
Dashed lines represent those nodes with no effect during the same action. In 
these approach the action effects are denoted by: Effects[aO,al] = {jjw,pd’}, 
Ejfects[a2, aS\ = {fins’, g'}, and Ejfects[a4\ = {0} for the null action. 



6 Experimental Results 

In order to test the factored MDP-based operator assistant, a simplified version 
of a linear steam generation simulator was built using Java2 [11]. The proba- 
bilistic state transition model was implemented based on Elvira [6], which was 
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extended to compute Dynamic Bayesian Networks. The tests were made using 
a Pentium PC. 

We first constructed a transition model extensionally with the idea of compar- 
ing its compilation time versus a factored model. The first case is a | S' p dimen- 
sion conditional probability matrix with 384 probability values/action where S is 
the state space. With this model it took 4.22 minutes to build a probability table 
for one state and one action only. So that, the model for one state and four actions 
would take 16.87 minutes, and a complete model would take almost two hours 
for S (384 possible states). The number of parameters enumerated is 737,284. 

On the other hand, the factored model is composed by two 32x16 tables (512 
parameters) for the actions aO and al, and two 96x16 tables (1,536 parameters) 
for the actions a2 and a3. The null action model in both cases (a4) was built 
very easily assuming deterministic transition from one state to itself (probabil- 
ity=1.0). The results were as expected, from the factorized representation the 
complete transition model was obtained in less than two minutes. In this case 
the number of enumerated parameters was 512(2) -|- 1,536(2) = 4,096. If we 
expressed the space saving as the relation: number of non factored parameters 
enumerated / number of factored parameters enumerated, we would have for 
our example a model 180 times simpler. Symmetrically, if the time saving were 
expressed as non-factored execution time / factored execution time we would 
have a model 3,240 times faster. 

According to this technique, it is not necessary to enumerate all domain 
literals, neither evidence variables (time t) nor interest variables (time t-|-l). This 
simplification makes easier and more efficient the transition model construction. 
For example, in the case of actions aO and al the characteristic state only requires 
enumerating the variables Ffw, d and pd, with which 2^ x 8 ^ = 32 states. In the 
same way, the states affected by the action aO or al only need 2^ x 8 ^ = 16 
combinations. We also obtain computational savings in the reward specification, 
given that we do not need to specify a reward value for each state, but only for 
characteristic states, composed in this case by Pd and Fms. 

The factored MDP converged in 3 iterations (for a 7 = 0.3) using value 
iteration in less than 1 second. 

7 Conclusions 

Markov decision processes (MDPs) provide a powerful framework for solving 
planning problems under uncertainty. However, it is difficult to apply them 
to real world domains due to complexity and representation problems. In this 
work, we have applied the factored MDP formalism as a framework for solving 
these problems in industrial real-world domains. We described the features of 
an FMDP-based operator assistant for the treatment of transients in a HRSG. 
For reducing the state space complexity we use a factored representation based 
on a two-stage dynamic Bayesian network. For representing the reward function 
we have taken advantage of recommended optimal operation curve for the power 
plant. The model has been implemented and tested with a power plant simulator 
with promising results. 
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Some authors [7] recommend to split a big problem in small pieces solving 
them separately by a set of cooperative agents. Those multi-agent approaches 
in fact solve several simpler problems but need domain-dependant coordination 
strategies that in some cases are practically impossible to specify. 

Currently, we are exploring extensions to the FMDPs formalism based on 
symbolic representations from classical planning [2] to have additional reductions 
in the problem complexity. The main idea is avoid using multiagent approaches 
and preserve standard dynamic programming algorithms in the inference side. 
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Abstract. In this paper, a method is proposed for the interpretation 
of outdoor natural images. It lastly constructs a basic 2-D scene model 
that can be used in the visual systems onboard of autonomous vehicles. 
It is composed of several processes; color image segmentation, principal 
areas detection, classification and verification of the hnal model. The re- 
gions provided by the segmentation phase are characterized by their color 
and texture. These features are compared and classified into predefined 
classes using the Support Vector Machines (SVM) algorithm. An Inde- 
pendent Component Analysis (ICA) is used to reduce redundancy from 
the database and improve the recognition stage. Finally, a global scene 
model is obtained by merging the small regions belonging to the same 
class. The extraction of useful entities for navigation (like roads) from the 
final model is straightforward. This system has been intensively tested 
through experiments on sequences of countryside scenes color images. 



1 Introduction 

The construction of a complete model of an outdoor natural environment is one 
of the most complex tasks in Computer Vision [1] . The complexity lies on several 
factors such as the great variety of scenes, the absence of structures that could 
help visual process and the low control in the variation of the current conditions 
(as illumination [2], temperature and sensor motion). Only some of these fac- 
tors can be easily overcome or compensated. Moreover, real time algorithms are 
generally required. Image segmentation [3], i.e. the extraction of homogeneous 
regions in the image, has been the subject of considerable research activity over 
the last three decades. Many algorithms have been elaborated for gray scale im- 
ages. However, the problem of segmentation for color images, which convey much 

* The author thanks the support of the CONACyT. This work has been partially fun- 
ded by the French-Mexican Laboratory on Gomputer Science (LAFMI, Laboratoire 
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more information about the objects in the scene, has received lower attention, 
primarily due to the fact that computer systems were not powerful enough, until 
recently, to display and manipulate large, full-color data sets. 

In [4], an approach for image interpretation in natural environments has 
been proposed. It consists in several steps: first, a color segmentation algorithm 
provides a description of the scene as a set of the most representative regions. 
These regions are characterized by several attributes (color and texture) [5], 
so that to identify a generic object class they may belong to. Murrieta [4] has 
evaluated this approach to recognize rocks, trees or grassy terrains from color 
images. In [6], a pre-classification step was used in order to select the database 
according to some global classification based on the images: this step allows to 
use the best knowledge database depending on the season (winter or summer), 
the weather (sunny or cloudy) or the kind of environment (countryside or urban) . 

This paper is an extension of Murrieta’s works. In particular, we focus on 
two issues dealing with scenes modeling. The first issue is the fast and correct 
segmentation of natural images. It is a fact that a good segmentation is fun- 
damental to get a successful identification. The second issue is the information 
redundancy in the database. We aim to reduce it with an Independent Compo- 
nent Analysis (ICA). The ICA [7] of a random vector is the task of searching 
for a linear transformation that minimizes the statistical dependence between 
the output components [8] . The concept of ICA may be seen as an extension of 
the principal component analysis (PCA). ICA decorrelates higher-order statis- 
tics from signals while PCA imposes independence up to the second order only 
and defines orthogonal directions. Besides, ICA basis vectors are more spatially 
related local than the PCA basis vectors and local features give better object 
(signal) representation. This property is particularly useful for recognition. 

The organization of this paper is as follows. Section 2 describes the color 
segmentation algorithm. The process to characterize the regions obtained by the 
segmentation step is described in the section 3. In sections 4 and 5, we discuss 
the application of ICA to the pattern recognition tasks in particular with the 
SVM classification method. Final scene models and some experimental results 
are presented in the section 6. Finally, we give our conclusions and hints for 
future work. 



2 Color Image Segmentation 

Image segmentation is an extremely important and difficult low-level task. All 
subsequent interpretation tasks including object detection, feature extraction, 
object recognition and classification rely heavily on the quality of the segmen- 
tation process. Image segmentation is the process of extracting the principal 
connected regions from the image. These regions must satisfy a uniformity cri- 
terion derived from its spectral components. 

The segmentation process could be improved by some additional knowledge 
about the objects in the scene such as geometrical, textural, contextual or optical 
properties [3,9]. It is essentially a pixel-based processing. There are basically two 
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different approaches to the segmentation problem: region-based and edge-based 
methods. Region-based methods take the basic approach of dividing the image 
into regions and classifying pixels as inside, outside, or on the boundary of a 
structure based on its location and the surrounding 2D regions. On the other 
hand, the edge-based approach classifies pixels using a numerical test for a prop- 
erty such as image gradient or curvature. Each method has its pros and cons. 
Indeed, region-based methods are more robust, but at expense of poorer edge 
localization. Edge-based methods achieve good localization but are sensitive to 
noise. Our segmentation method belongs to the former category. Then, the color 
space selection is an important factor to take into account in the implementation 
since this defines the useful color properties [2]. Two goals are generally pursued: 
first, the selection of uncorrelated color features and second, the selection invari- 
ance to illumination changes. To this purpose, color segmentation results have 
been obtained and compared using several color representations. In our experi- 
ences, the best color segmentation was obtained using the I1I2I3 representation 
(Ohta’s space) [2], defined as: 

T — R+G+B 

J-1 3 j 

= ^ , ( 1 ) 

r _ 2 G-R-B 

h - 4 ■ 

The components of this space are uncorrelated, so statistically it is the best 
way for detecting color variations. They form a linear transformation [9] of 
the RGB space where Ii is the intensity component, I2 and I3 are roughly 
orthonormal color components^. To cope with low saturation images, we use a 
hybrid space. In this case, under-segmentation is expected due to poor colori- 
metric features. However, we have maintained good results in segmentation, by 
slightly reinforcing the contours over each color component before segmentation. 
In order to avoid to implement a complex edge detector, an alternative consist 
in using the color space I1I2I3 proposed by Gevers [10] formulated as: 

7 _ |fl-G| 

‘1 “ |K-G| + |K-B| + |G-B| ’ 

7 _ \R-R\ (o\ 

‘2 — |k_G| + |K-B|-|-|G-B| ’ 

, _ \G-B\ 

‘3 — |K-G| + |K-B| + |G-B| • 

These color features present a very good behavior to illumination changes. 
Moreover, their advantages are particularly important in edge detection and 
pattern recognition. 

Our segmentation algorithm is a combination of two techniques: the thresh- 
olding or clustering and region growing techniques. The advantage of this hybrid 

^ I 2 and I 3 are somewhat similar to the chrominance signal produced by the opponent 
color mechanisms of human visual system 
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a) b) c) 



Fig. 1. Color image segmentation, a) Original low saturated color image, b) seg- 
mentation using Ohta’s color space and c) Segmentation using an hybrid color 
space(Gevers/Ohta) used only for low saturated color images. 



method is that it allows to achieve the process of region growing independently of 
the starting point and of the scanning order on the adjacent cells. This segmen- 
tation is carried out in a bottom-up way: dividing small regions first and then, 
combining them to form greater regions. The method does the grouping of the 
pixels in the 3-D spatial domain of square cells and gives them the same label. 
The division of the image into square cells provides a first arbitrary partition. 
Several classes are defined by the analysis of the color histograms. Thus, each 
square cell in the image is associated with a class. The fusion of the neighboring 
square cells belonging to the same class is done by using an adjacency graph 
(adjacency-8). Finally, to avoid over-segmented images the regions smaller than 
a given threshold are merged to the nearest adjacent region using a color distance 
criterion. We have adopted the method suggested by Kittler [11] to get the prin- 
cipal thresholds from the histograms. It assumes that the observations come from 
mixtures of Gaussian distributions and uses the Kullback-Leibler criterion from 
information theory to estimate the thresholds[ll]. In our implementation, this 
approach is generalized to get the optimal number of thresholds. Next section 
explains the features we use to characterize each of the segmented region. 

3 Color and Texture Object Features 

Texture is the characteristic used to describe the surface of a given object, and 
it is undoubtedly one of the main features employed in image processing and 
pattern recognition. This feature is essentially a neighborhood property. Haral- 
ick provides a comprehensive survey of most classical, structural and statistical 
approaches to characterize texture [12]. 

In our approach, each region obtained by the previous segmentation algo- 
rithm is represented by a color and texture vector. The texture operators we 
use are based on the sum and difference histograms proposed by M. Unser [5], 
which are a fast alternative to the usual co-occurrence matrices used for texture 
analysis. This method requires less computation time and less memory storage 
than the conventional spatial texture methods. Then, a region can be character- 
ized by a collection of sum and difference histograms that have been estimated 
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for different relative displacements 5x and 5y. For a given region into the image 
I {x,y) G [0,255], the sum and difference histograms are defined as: 

hs{i) = Card{i = I {x,y) + I {x + 5x,y + 5y)) , ,, 

hd{j) = Card{j=\I{x,y)-I{x + 5x,y + 5y)\) , 

where i G [0,510] and j G [0,255]. Sum and difference images can be built for 
all pixel (x, y) of the input image /, 

Is{x,y) = I{x,y) + I{x + 6x,y + 5y) , , , 

Id{x,y)=\I{x,y)-I{x + 5x,y + 5y)\ . 

Furthermore, normalized sum and difference histograms can be computed for 
selected regions of the image, so that: 

„ /•'I _ Card{i=I^{x,y)) 

S \^) JYI 5 

H^^j)= C-^dU=Uix,y)) ^ (5) 

Hs (z) G [0, 1] and (j) G [0, 1] , 

where m is the number of points belonging to the considered region. These 
normalized histograms may be interpreted as probabilities. Ps{i) = Hs (z) is the 
estimated probability that the sum of the pixels / (x, y) and / (x + 5x, y + Sy) 
will have the value i and Pd(j) = Hd (j) is the estimated probability that the 
absolute difference of the pixels I (x,y) and / (x + 6x,y + Sy) will have value 
j. Several statistics are applied to these probabilities to generate the texture 
features. 

We have chosen seven texture operators from these histograms: Energy, 
Correlation, Entropy, Contrast, Homogeneity, Cluster shade and Cluster promi- 
nance. This selection has been based on the Principal Component Analysis over 
all texture features proposed by Unser [5]. In this way we obtain a probabilis- 
tic characterization of the spatial organization of an image. Although, the his- 
tograms change gradually in function of the viewpoint, the distance from the 
sensor to the scene and the occlusions, these features are rather reliable. Ad- 
ditionally, the statistical means of /i, I 2 and over each region are used to 
characterize the color into the region. 

4 Classification Using Support Vector Machines 

For our object recognition problem, the classifier input is composed of several 
numerical attributes (color and texture) computed for each objet (region) in the 
image. SVM [13,14] is one of the most efficient classification methods, because 
the separation between classes is optimized depending of kernel functions (lin- 
ear, polynomials, Gaussian radial basis function, etc.). Like NNs, SVMs are a 
discriminative supervised machine learning technology, i.e. they need training 
with labeled empirical data in order to learn the classification. 
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The region attributes we described in section 3 (IR^° vector) were used to 
build the training sets (classes) considering the selected texture operator along 
the eight basic possible directions {(t, j)}(j^j)g{_i^o,i}/{(o,o)} for the 5x and 6y 
relative displacements. The final texture value is obtained by the mean of those 
eight values. The idea is to avoid any preferential direction in the construction 
of the training set. However, in the classification step, we have chosen only one 
direction, Sx = 6y = 1, in order to speed up the identification stage. 

The database we get is pre-processed and filtered by the Independent Com- 
ponent Analysis technique. The Support Vector Machines[13] technique has been 
efficiently evaluated and gave very good results. However, we have found that 
it is not very versatile to cope with periodic database updates due to its slow 
learning convergence with our database. Last, note that for our specific prob- 
lem, we have used a third degree polynomial optimization kernel function into 
the SVM implementation. 

5 Independent Component Analysis 

Recently, blind source separation by ICA has attracted a great deal of attention 
because of its potential applications in medical signal processing (EEC and MEG 
Data), speech recognition systems, telecommunications, financial analysis, image 
processing and feature extraction. ICA assumes that signals are generated by sev- 
eral (independent) sources, whose properties are unknown but merely assumed 
non-Gaussian distributed. The goal of this statistical method is to transform an 
observed multidimensional random vector into components that are statistically 
as independent from each other as possible [15]. In other words, “ICA is a way 
of finding a linear non-orthogonal co-ordinate system in any multivariate data” . 
Theoretically speaking, ICA has a number of advantages over PCA and can be 
seen as its generalization. 

Definition 1. ICA feature extraction of random data vector -x. = (xi,X 2 , ■■■,Xdf 
of size d consists in finding a transform matrix W G IR'”^'^ such that the com- 
ponents of transformed vector are as independent as possible. We assume that x 
is the linear mixture of m independent sources s = (si, S 2 , Sm)* without loss 
of generality, 



m 

X = As = , (6) 

A G IR'^^'" is an unknown matrix called mixing/feature matrix and the source 
Si has zero mean. The columns of A represent features; we denote them by aj 
and Si signals the amplitude of the ith feature in the observed data x. 

If we choose the independent components Si to have a unit variance, 
E{SiSl} = 1, i = 1,2, ...,n, it will make independent components unique, ex- 
cept for their signs. So the problem of ICA feature extraction can be seen as to 
estimate both A and s by the following linear transform from x. 



s = Wx . 



(7) 
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The statistical model in equation 1 is called independent component analysis, 
or ICA model. Indeed, it is sufficient to estimate A because the estimate of W 
is the pseudoinverse of A. 

Generally, it is very useful to apply some preprocessing techniques to the 
data in order to make the ICA estimation simpler and better conditioned: 

1. Centering x, i.e. subtract its mean vector x = x — E{x}. 

2. Whitening x, by which we linearly transform x into a “white” vector x which 
has uncorrelated components and with unitary variances E{xx*} = I. 

Singular Valor Decomposition (SVD) is usually used to obtain the whitened 
matrices in ICA procedures. Lately, there have been considerable research in- 
terest and many fundamental algorithms [15,7] have been proposed to extract 
efficiently the ICA basis from the data features. Hyvarinen et al. [8], intro- 
duce a fixed-point iteration scheme to find the local extrema of the kurtosis 
(kurt{x) = E{x‘^} — 3 (A{x^}) ) of a linear combination of the observed vari- 
ables X. This algorithm has evolved (fastICA algorithm) using a practical opti- 
mization on the contrast functions. These are derived by the maximization of 
the negentropy ^ relationship. Negentropy of a random vector x with probability 
density f{x) is defined as follows 

'^(^) f gauss)^^9 f gauss)dx , (8) 

where XgOuss is a Gaussian random variable of the same covariance matrix as 
X. FastICA can also be derived as an approximative Newton iteration, see [8] 
for further details. This fixed point approach was adopted in this paper. 

SVM method has been trained using the ICA representation of all database 
classes. ICA-based recognition procedure is fairly similar to the PC A application 
methodology. Initially, the unlabeled vector (color and texture attributes) com- 
puted for each region in the segmented image is normalized (centered, whitened). 
It is projected into each ICA subspace (ICA representation for each predefined 
database class must be available). The projection is simply an inner product 
between the input vector and each projection bases, 

s = W(x-E{x}) . (9) 

Thus, the input projected vectors can be identified comparing them with the 
ICA database basis using the Support Vector Machines classifier. Next section 
presents some experimental results obtained for natural scene modeling. 

6 Scene Model Generation 

The user must define a priori how many classes will be interesting for his ap- 
plication. Obviously, this class selection depends on the environment type. In 

^ This measure of nongaussianity is always non-negative and it has the additional 
interesting property that is invariant for invertible linear transformations. 
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Grass 

Road 



a) b) c) 

Fig. 2. Natural scene modeling, a) Outdoor color image, b) Region based segmented 
image and c) 2-D Final scene model. 



our experiments, we have selected 7 classes, Sky, Field, GRASS, Tree, ROCKS, 
Water and Wood. These database classes have been carefully filled up using the 
color and texture information obtained from the learning images. Moreover, this 
supervised database was processed using ICA in order to feed up the training 
SVM procedure. 

Usually, the color segmentation methods generates over-segmented images 
thus a fusion phase is needed. In our methodology, that phase must merge all 
the neighbor regions with the same texture and color characteristics (same na- 
ture). In order to complete the fusion, we need to use the results from the 
characterization and classification stages. This process gives us an image with 
only the most representative regions in the scene. 

In the table 1 is shown the confusion matrix for the most representatives 
label classes in our experiments. These results have been computed using ICA 
and SVM recognition procedures. It is important to notice that large regions are 
more reliably classified than small regions. 

In the evaluation stage, we have analyzed 2868 regions obtained from 82 test 
images. In these images, 2590 regions were identified as belonging to one of the 
proposed classes, i.e. 462 regions were correctly labeled as “road”, 43 “road” 
regions are incorrectly detected as “Tree” and 17 “Rocks” regions were mis- 
labeled. Furthermore, 2242 regions were correctly identified and 278 regions were 
non-classified (outliers) by the SVM technique. Apart from the classes “Wood” 
and “Water”, for which there are very few samples available, all the classes 
were classified very accurately. Using a preprocessing stage with ICA gives us 
a relative better score than PCA. In training and testing the classifier we have 
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Table 1. Confusion Matrix for Region Classification Using Color and Texture. 



Classes 


Tree 


Sky 


Crass 


Road 


Field 


Water 


Rocks 


Success 


Tree 


613 


7 


64 


58 


3 


0 


32 


78.89 % 


Sky 


12 


163 


0 


0 


0 


1 


0 


92.61 % 


Crass 


69 


0 


934 


8 


1 


2 


0 


92.11 % 


Road 


43 


0 


0 


462 


5 


0 


1 


90.41 % 


Field 


0 


0 


1 


3 


23 


0 


0 


85.18 % 


Water 


2 


2 


1 


0 


0 


6 


0 


54.55 % 


Rocks 


11 


5 


0 


17 


0 


0 


41 


55.41 % 



only used regions corresponding to the 7 classes. In systems having less set of 
classes it would be necessary to include a further class for unknown objects. 

In our experiments, a standard 400x300 pixels color image was processed by 
the color segmentation algorithm in about 80 ms on a Sparc Station 5. Some 
preliminary results are shown in the figure 2 where the classification ICA/SVM 
was applied to it. The recognition results using the SVM method are the follow- 
ing: a recognition rate of 80%, without data pre-processing, 83.35% using with 
PCA and 86.56% using ICA preprocessing. The total execution time including 
all the stages has taken less than 1.0 s per image. The test images were taken in 
spring scenes with different illumination conditions. It is particularly important 
in applications as automatic vehicles and robotics to dispose of a visual model of 
the environment. Thus, the good recognition rate obtained with the road zones 
may be exploited to complement the visual navigation systems. 

7 Conclusions 

We have presented an approach for the detection of navigation zones in out- 
door natural images (earthed roads, planar regions, etc) relying on a robust, 
modular, fast color image interpretation scheme in the Ohta/Gevers color space. 
Using color segmentation, we get the principal components (regions, objects) of 
the image. These regions are characterized by their color and texture distribu- 
tion. Independent Component Analysis allows us to construct and filter a robust 
database so that redundancy is reduced and recognition rate is improved. 

We have used the SVM and ICA methods to classify the representative vec- 
tor (IR^°) extracted from each region in the image. A complete 2-D model of the 
outdoor natural images is built with very good results (success of 86.56% ). We 
have also observed that using ICA instead of PCA leads to better recognition 
rates. Using PCA with SVM also gave us good results because SVMs are rela- 
tively insensitive to the representation space. Moreover, ACI algorithm is more 
computationally demanding than PCA. 

In the future work, we will use powerful shape descriptors and contextual 
scene analysis in order to add robustness to the recognition phase of objects into 
the image. 
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Abstract. We present the SSD-ARC, a non-parametric registration 
technique, as an accurate way to calibrate a camera and compare it 
with some parametric techniques. In the parametric case we obtain a 
set of thirteen parameters to model the projective and the distortion 
transformations of the camera and in the non-parametric case we obtain 
the displacement between pixel correspondences. We found more accu- 
racy in the non-parametric camera calibration than in the parametric 
techniques. Finally, we introduce the parametrization of the pixel cor- 
respondences obtained by the SSD-ARC algorithm and we present an 
experimental comparison with some parametric calibration methods. 



1 Introduction 

Most algorithms in 3-D computer vision rely on the pinhole camera model be- 
cause of its simplicity, whereas video optics, specially wide-angle lens, generate 
a lot of non-linear distortion. Camera calibration consists in finding the optimal 
transformation between the 3-D space of the scene and the camera plane. This 
general transformation can be separated in two different transformations: first, 
the projective transformation [15] and the distortion transformation [2]. 

This paper focuses in automatic camera calibration method based on a non- 
parametric registration technique, named SSD-ARC. We compare our approach 
with recent parametric techniques given in [17] and [12], obtaining more accurate 
calibrations with the SSD-ARC approach. 

2 Registration Algorithm 

Let I\ and I 2 be a couple of images with gray level pixels. Individual pixels of 
image It are denoted by Itiji) where = [xi,yi^ and Vi is over the lattice L 
of the image. Let I\ be a synthetic calibration pattern and I 2 be this pattern 
viewed by the camera under calibration. We can model the calibration process 
as finding a set of displacements V = {Vi,V 2 ,- ■ where Vi = [ui,Vi]^ is a 
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displacement for one pixel at position Vi G l 2 - The goal is to match I 2 , after the 
application of the displacement set V, with /i: 



Ii(xi,yi) = l2(xi + Ui,yi + v,) ( 1 ) 

In the literature the are many methods, to solve equation (1), based in first 
derivatives [3,7,4] and others are based on second derivatives [8,19,20]. They 
compute the Optical Flow (OF) using Taylor’s expansion of (1). In [4] a linear 
approximation of (1) and a smoothness constrain are used over the magnitude 
of the displacement gradient |VVi| . 

Others use the Sum of Squared Differences (SSD), a measure of proximity 
between two images which depends on the displacement set V as: 

SSD{V) = ^ [h {xi, yi) - h {xi + Ui, yi + Vi)f (2) 

i&L 

In [16] the authors use the SSD function and an interpolation model based on 
Splines, they compute the OF and use it in the registration task. Minimization 
of their function is accomplished using Levenberg-Marquardt’s algorithm [13]. 
In [6] they add, to the SSD function, the same smoothness constrain as in [4] 
and additionally they normalize the derivatives with respect to the magnitude 
of the displacement gradient. 

In this paper we proposed to use a variant of the SSD function to solve 
equation (1) and therefore calibrate the camera. First we review briefly, the 
parametric calibration method presented in [12], then in section 4 we describe 
our approach. In section 5 we compute a parametric representation of the dis- 
placement set V and finally we show an experimental comparison between all 
these methods. 

3 Parametric Camera Calibration 

Let Vi = T{6p, 9d, Vi) be a parametric representation of the displacement of pixel 
at location n. Where Op = represents a projective transformation 

and 9d = [0g,...,0i2] represents a distortion transformation. Tamaki et al [17] 
compute the complete transformation in two steps, first they asume a given 
distortion transformation and then compute the projective transformation that 
match I 2 into Ii, minimizing a SSD metric. In the second step, they find the 
inverse distortion transformation minimizing another SSD metric. Finally they 
apply an iterative procedure in order to find the right distortion transformation. 
In [18] they do a new presentation of this algorithm, they continue computing 
the transformation in two steps and introduce the implicit function theorem in 
order to find the right distortion instead of the inverse distortion transformation. 
They minimize their SSD function using the Gauss-Newton algorithm. Because 
they made some approximations in the way to compute the first derivatives of 
the projective model and they solve the problem in two isolated steps, their 
method is very sensitive to the initial values. 
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In [12] they find the complete transformation in one step, thirteen parame- 
ters for distortion and projective model using a SSD metric and solve it using 
the known algorithm Gauss-Newton-Levembert-Maraquart [9]. Their method 
is more robust, simple and accurate than Tamaki’s approaches [17,18]. 

4 Non— parametric Camera Calibration 

In [1] Calderon and Marroquin present an approach based on a physic analogy of 
the equation SSD which represents the potential energy kept in a set of springs, 
where the error between the images is represented by the spring deformations. 
They focus on minimizing the total potential energy given as the sum of the 
energy in each one of the springs. Because the image noise increase the total 
energy, they introduce a way to disable the outlier contribution to the global 
energy, based on the image’s gradient magnitude. 

They proposed an energy function which is based on a coupling system of 
springs with an Adaptive Rest Condition(ARC) [10]. The SSD-ARC energy 
function is: 

USSD-ARC{V, 0 = E ^ ^ I VR,1 V M ^ (3) 

z^L 



with 

H{Vi) = ]V/2 {xi + u^,yi + r;^)] 



E{Vi) = I^{xi,y^) - hixi + Ui,yi + v^) 



^ E{Vi)H{V,) 

' y + H^Vi) 

where E{Vi) is the error term between image I\ and I 2 which depends of the 
displacement Vi = [ui,Vi^ . H{Vi) is the magnitude of the gradient of the image 
I 2 for a given displacement. The parameter li implements the way to exclude the 
outliers, in other words, exclude pixels with a hight errors. The behavior of this 
energy function depends on the regularization parameter y and A. The parameter 
y controls the exclusion of outliers and parameter A controls the displacement 
similarity between neighbors. 

To minimice Ussd arc in equation (3) they used Richardson’s iteration 

(see [5]) and the final solution scheme is : 



yk+l 

Z 




1 E{Vt) 

Xy + 



^E{Vt) 



(4) 









t2 
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Fig. 1. Synthetic image pattern (left) and the correspondent camera image (right) 



Where Vi is the average displacement computed as: Vi = V{xi+i,yi) + 
V{xi-i,yi) + V{xi,yt+i) + V{xt,yi-i) (see [1] for details). 

In order to have a better performance, for the SSD-ARC, they use the scale- 
space (coarse to fine strategy) [11]. 

We propose to use SSD-ARC as a non-parametric calibration technique. In 
some applications the set of displacement V is enough to solve the calibration 
problem. However if you want to correct only the distortion of the image, without 
the projective transformation, a further step is needed. In the following section 
we use the displacement set V to compute a parametric representation for the 
distortion transformation and then we are able to undistort the images. 

5 Point Correspondences Method 

To find the thirteen parameters of the projective and distortion transformation, 
given the set of displacement V computed by SSD-ARC, we use the Point Cor- 
respondences Method (PC), described in [12]. They use a SSD metric based 
on the Euclidian distance and minimice the resulting error function using the 
Gauss-Newton-Levenberg-Marquard method [9]. We name this algorithm SSD- 
ARC-pc. 

The algorithm is as follows: 

1. - Compute the set of displacements V using the SSD-ARC. 

2. - Estimate 0^ using the PC method. Discard 6p 

An interesting variant of this method is to take a subset of V instead of the 
whole set V. Since the calibration pattern has a regular pattern (rectangles) a 
natural choice is to select only Conner points. We denoted this variant of the 
method as SSD-ARC-pc*. 

6 Experimental Results 

We test a Fire-i400 firewire industrial color camera from Unibrain with 4.00 mm 
C-mount lens. This camera acquire 30 fps with resolution of 640x480 pixels. 
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Table 1. SSD errors for the parametric and the non-parametric algorithms 



Algorithm 


Scales 


Time (sec.) 


Error 


SSD-ARC-pc 


1 


92.469 


15456.726 


Tamaki 


3 


163.844 


15263.134 


Romero 


1 


57.000 


14820.000 


SSD-ARC-pc* 


1 


79.469 


14512.966 


SSD-ARC 


4 


60.469 


11847.001 



Table 2. Calibration results from the parametric approaches 



Parameter 


SSD ARC - pc 


Tamaki 


Romero 


SSD - ARC - pc* 


^0 


1.056 


1.0560 


1.0431 


1.0447 


6i 


0.0408 


0.0586 


0.0433 


0.0427 


02 


-22.6761 


-20.7956 


-19.5562 


-18.4801 


03 


-0.0156 


-0.0108 


-0.0172 


-0.0144 


04 


1.0488 


1.0696 


1.0470 


1.0515 


05 


3.7795 


0.8944 


4.3369 


3.4368 


06 


3.4405e-5 


-5.3654e-5 


2.8627e-5 


4.1702e-5 


07 


6.2721e-5 


-1.2009e-4 


6.7935e-5 


6.7851e-5 


08 


-8.2396e-7 


4.9466e-7 


-7.8158e-7 


-8.0626e-7 


09 


1.2537e-12 


3.1922e-12 


7.1399e-13 


9.9604e-13 


010 


311.12427 


320.71564 


307.9092 


313.8008 


Oil 


203.4334 


236.0663 


208.59712 


208.0083 


012 


1.0092 


1.0089 


0.9983 


0.9916 



The pattern calibration showed in the figure 1 was made using xfig program 
under Linux and the image taken by the camera is shown in the same figure. 

In order to have a reference about the performance of our methods we tested 
the parametric algorithms described by Tamaki [17], Romero [12] versus SSD- 
ARC (section 4), SSD-ARC-pc and SSD-ARC-pc* (section 5) using the two 
images showed in figure 1. 

The error measure, in all cases, was the SSD function given by equation 
(2). Table 1 shows results for these algorithms including the number of scales 
used by the algorithm and the computation time (Pentium IV 1.6 Ghz). The 
final parameters computed by each method are showed in table 2. Note that 
Tamaki [17] algorithm compute the inverse transformation and you can observe 
this fact in the sign of parameter Note that the least error was given by the 
non-parametric algorithm SSD- ARC. Nevertheless, when we parameterized this 
solution (SSD-ARC-pc* algorithm) we have a significant increment in the total 
error. You can check this fact in the table 1 in the rows for SSD-ARC and SSD- 
ARC-pc. Also observe that SSD-ARC-pc* algorithm gives better results than 
SSD-ARC-pc, because displacement of corners are more reliable than other dis- 
placement of pixels. In other words, corners are good feature points to track [14]. 

A visual comparison for all these methods is showed in figure 2. Each image is 
formed by the squared differences between the calibration pixels of the pattern 
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Fig. 3. Images from our laboratory. Original image taken by the camera (left), and 
nndistorted image computed by SSD-ARC (right) 



image and the corrected camera image. A perfect match would give a black 
image and lighter pixels correspond to higher differences. The best match is given 
by SSD-ARC followed by SSD-ARC-pc*. The worst match is given by SSD- 
ARC-pc specially on the corners of the image. The SSD-ARC non-parametric 
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approach has a better correction of the non-linear distortion introduced by the 
lens, than parametric approaches presented in this paper. 

Finally figure 3 show an image of our laboratory taken by the camera and 
results applying the SSD-ARC approach described in this paper. Note the cor- 
rection specially at the top of the image and left and right sites. 

7 Conclusions 

We have presented two methods to solve the calibration problem, the first one is 
a non-parametric method, named SSD-ARC, and the second one is a parametric 
method, named SSD-ARC-pc. The SSD-ARC method gave the least error in 
the comparison with other methods. The SSD-ARC-pc* merges the parametric 
and non-parametric calibration methods. We found more accurate results using 
the corners points of the images instead of all the points and also a slightly error 
reduction compared with results from [12]. 

In some application, for instance in stereo vision systems, the images cor- 
rected by SSD-ARC could be the images of rectified cameras (the optical axes 
are parallel and there are not rotations). The SSD-ARC approach is specially 
valuable when you use cheap lens with a high non-linear distortion, where, the 
parametric approaches could have a poor accuracy. The non-parametric ap- 
proach also could be used with wide angle lens or even with eye fish lens. In 
contrast, parametric approaches need to increase the number of parameters with 
cameras using large wide angle lens or eye fish lens. 

We plan to use the SSD-ARC algorithm with cameras, using large wide angle 
lens, in order to build a stereo vision system for a mobile robot. 
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Abstract. In this paper, we propose a method that utilizes the motion vectors 
(MVs) in MPEG sequence as the motion depicter for representing video 
contents. We convert the MVs to a uniform MV set, independent of the frame 
type and the direction of prediction, and then make use of them as motion 
depicter in each frame. To obtain such uniform MV set, we proposed a new 
motion analysis method using Bi-directional Prediction-Independent 
Framework (BPIF). Our approach enables a frame-type independent representa- 
tion that normalizes temporal features including frame type, MB encoding and 
MVs. Our approach is directly processed on the MPEG bitstream after VLC 
decoding. Experimental results show that our method has the good 
performance, the high validity, and the low time consumption. 



1 Introduction 

A huge volume of information requires the concise feature representations. The 
motion feature of them provides the easiest access to its temporal features, and is 
hence the key significance in video indexing and video retrieval. That is the reason 
that the motion-based video analysis has received large attention in video databases 
research [1, 2, 3]. 

In the area of motion-based video analysis, many researchers have followed 
methods of motion analysis such as optical flow popular in computer vision and block 
matching popular in image coding literature. Cherfaoui and Bertin [4] have suggested 
the shot motion classification scheme that computes global motion parameters for the 
camera following an optical flow computation scheme. An example of block 
matching based motion classification scheme is the work of Zhang et al. [5] that carry 
out motion classification by computing the direction of motion vectors and the point 
of convergence or divergence. Noting that using the motion vectors embedded in P 
and B frames can avoid expensive optic flow or block matching computation, several 
researchers have stated exploring shot motion characterization of MPEG clips [6] [7]. 
Kobla [8] have proposed the use of flow information to analyze the independent 
frame-type pattern (i.e., the pattern of I, P, and B frames) in the MPEG stream. But, it 
has a problem that flow estimation is considered by only single-directional prediction 
in most frames. 

In this paper, we propose the motion analysis method that normalizes MVs in 
MPEG domain using Bi-directional Prediction-Independent Framework. The 
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normalized MVs (N-MVs) enable accurate video analysis which is useful in contexts 
where motion has a rich meaning. We have a purpose in constructing an effective and 
comprehensive motion field by making up for the week point in current sparse motion 
field. From such process, the N-MVs by BPIF can be utilized as motion depicter to 
efficiently characterize motion feature of video, and as feature information for global 
camera motion estimation in each frame, video surveillance system, video indexing 
and video retrieval. 

The remainder of the paper is organized as follows. In the next section, we propose 
the motion analysis method using BPIF in compressed domain and establish the 
active boundary for the proposed one. In Section 3, experimental results will be 
demonstrated to corroborate the proposed method. Finally, we conclude this paper in 
Section 4. 



2 Motion Analysis Using Bi-directional Prediction-Independent 
Framework 

An MB in MPEG sequence can have zero, one, or two MVs depending on its frame 
type and its prediction direction [9]. Moreover, these MVs can be forward-predicted 
or backward-predicted with respect to reference frames which may or may not occur 
adjacent to the frame including the MB. We therefore require the more uniform set of 
MVs, independent of the frame type and the direction of prediction. Our approach in 
this paper involves representing each MV as the backward-predicted vector with 
respect to the next frame, independent of the frame type. 



2.1 Motion Flow Estimation 

Let us consider two consecutive reference frames, R and R . Let the B frames between 

' ] 

them be denoted by B B^, where n is the number of B frames between two 
reference frames (typically, n=2). Lrom the mutual relation among frames, we can 
represent each MV as a backward-predicted vector with respect to the next frame, 
independent of the frame type [8][10]. Roughly, this algorithm consists of following 
two steps. 



Motion Flow for R, and Frame. We derive the flow between the first reference 
frame R. and its next frame B^ using the forward-predicted MVs of B^. Then, we can 
derive the flow, using the backward-predicted MVs between the second reference 
frame R. and the previous frame B^. Intuitively, if the MB in the B^ frame, is 
displaced by the MV (x, y) with respect to the MB in the R, frame, then it is 
logical to conjecture that the latter MB is displaced by the MV (-x, -y) with respect to 
where u and v denote the indices of the current MB in the array of MBs. 
Therefore, the flow for the R. frame is obtained by the MV with respect to the MB in 
the Bj frame. Also, the motion flow in the frame is estimated by using the similar 
method as described above [8]. 

But, these algorithms have the problem that the number of extracted N-MVs is not 
enough to represent each frame. The motion flow for the R. frame uses only forward- 
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predicted vectors in B frames and the motion flow for the frame uses only 
backward-predicted vectors in Rj frames. It means that this method has considered 
only single-directional prediction to estimate the motion flow for /?.and B^ frames. 

Our approach to compensate the above problem is to implement a bi-directional 
motion analysis algorithm using forward-predicted motion vector from R- to R., free 
from the limitation of single-directional motion analysis (see Fig. 1). The number of 
N-MVs in motion flow for the frame can be improved by using both the backward- 
predicted MV from the frame to the B, frame and the forward-predicted MV from 
the R. frame to the R. frame, in contrast with the conventional algorithm which only 
uses the forward-predicted MV in the Bj frame to estimate the motion flow in the R, 
frame. To derive the motion flow for B^ is almost similar to the estimation method in 
R- frame. 



► Prediction Direction 



Motion How &tiniation 




(a) (b) 



Fig. 1. Flow estimation, (a) Flow estimation of MVs in f?,frames. (b) Flow estimation of MVs 
in 5^ frames. 



The numerical formula is expressed in Eq. (1) and Eq. (2). Let the backward- 
predicted MV in be denoted by B^R. ■ And let the forward-predicted MV from 

to be denoted by r.r. . Then, we can calculate , the motion flow in R- 
frame, as 

cd 

And, if we denote the forward-predicted MV in 5„(,^„,by R.B ^ , then, we can obtain 
the motion flow, r^R ' , from the relationship as follows : 

( 2 ) 

Fig. l-(a) and Fig. l-(b) show motion analysis method using BPIF in R. and B^ 
frame, respectively. The solid line represents the direction of the real MV, and the 
dotted line shows the direction of the re-estimated MV. We derive the more elaborate 
motion flow in and B^ frame by using Eq. (1) and Eq. (2). 



Motion Flow for Bj ~ ^ Frame. The motion flow between successive B frames is 

derived by analyzing corresponding MBs in those B frames and their motion vectors 
with respect to their reference frames. Since each MB in each B frame can be of one 
of three types, namely forward-predicted (F), backward-predicted (B), or 
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bidirectional-predicted (D), there exist nine possible combinations. Here, we represent 
these nine pairs by FF, FB, FD, BF, BB, BD, DF, DB, and DD. Each of these nine 
combinations is considered individually, and flow is estimated between them by 
analyzing each of the MVs with respect to the reference frame. The algorithm is 
divided into following four parts by combinations of corresponding MBs in 
successive B frames. 

Case 1 ) when corresponding MBs have forward- + forward-predicted MV 
: FF, FD, DF, DD ; 

Case 2 ) when corresponding MBs have backward- + backward-predicted MV 
: BB, BD, DB ; 

Case 3 ) when corresponding MBs have forward- + backward-predicted MV ; FB ; 
Case 4 ) when corresponding MBs have backward- + forward-predicted MV : BF ; 

The motion flow estimation method in Case 1 and Case 2 of above four cases is 
obtained by similar method to it in R, or frame. Let the forward-predicted MV in 
be denoted by , and let the backward-predicted MV in be denoted 

by ■ Fig. 2-(a) and Eq. (3) expresses the motion flow in Case 1. The figure and 
the numerical formula in Case 2 are described in Eig. 2-(b) and Eq. (4), respectively. 

~ ^k+i^i — ■*“ ^k^k+i 

Bf,Rj = ■ (4) 

To derive the motion flow in Case 3 and Case 4, we reuse the forward-predicted 
MV from the R frame to the R frame, r.r . In Case 3, the motion flow, B.B, , , is 

derived by B^^^R^ , RiB,^ , and r.r, . This case is shown in Eig. 2-(c) and expressed in 
Eq. (5). 

Rjfj-R;B,^^^,+BZjfj- (5) 

And, if we denote the motion flow between R , , and B. ,, ,hy R B, , , and between 

and by b^R . , we can calculate B^B^^^ , by using R^^ and^^ , which is 

the motion flow in Case 4. This structure is shown in Fig. 2-(d). We can obtain Eq. (6) 
and Eq. (7) by the analysis of the flow and Eq. (8) is deducted from these equations. 

RSi^~^k+~^j (6) 

+ (7) 

RS,-R;^,^BJi.-B;^,. (8) 



2.2 Active Boundary of the Proposed Method 

The flow estimation method in this paper works on the assumption that ‘if 
which is the MB in the non-reference frame Q has the MV (x, y), it is equivalent that 
^f» 2 v 2 ; which is the MB in the reference frame R is moved by the MV (x, y) to 
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<J>,0 : Forward MV 

<D : Backward MV that is to tie calculated 




<1>,0 : Backward MV 
^ : Backward MV that is to be calculated 



(a) 



(b) 





Fig. 2. Flow estimation of MV s in ~ ^ frames. (a)~(d) Flow estimation of MV s from case 1 

to case 4, successively. 



However, it should be mentioned that it may always not be appropriate to use the 
MVs in same MB (u,,v,) over Q and R frames. That is, a problem is occurred by the 
following fact; if has the MV (x, y), it is the MV predicted not by R^^ 2 v 2 y hut 
by as Fig. 3-(a). When R^^ 2 v 2 > moves to Q frame by the MV (x, y), the 

corresponding MB is actually 2f„2v2r Therefore, to satisfy the above assumption, the 
error distance, S), should be minimized as shown Fig. 3-(b). Namely, most of the 

MVs in compressed domain should be within the error distance, !D. To verify this 
supposition, we calculate the MV histogram and the normal distribution of MV from 
various types of video scenes. A random variable S) has a normal distribution if its 
probability density function is 

p{D) = (2Tra^ ) ^ exp 






for —oo<T><oo- 



The normal probability density function has two parameters M(mean) and 
o(standard deviation). Fig. 4 shows the result of the normal distribution in various 
sequence such as news, movies, and music videos. Table 1 represents the numerical 
result. The normal distribution is differently calculated by the maximum value of the 
error distance which is smaller than the block size (-8< D <8) or the MB size (-16< 
S <16). If the error distance is within the block size, the distance between and 

^(u 2 v 2 ) which is the corresponding MB of will be smaller than the block size. 

Therefore, any neighbor MBs except R^^ 2 ^ 2 ) doesn’t include the area of R,^j^,f as much 
as R,^ 2 v 2 ) (Pig- 3"(c))- For this case, our assumption for the corresponding MBs can be 
regarded as appropriate. 
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(a) (b) (c) 

Fig. 3. Error estimation, (a) MB mismatching, (b) Error distance, (c) MB matching. 
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(a) (b) 

Fig. 4. Normal Distribution, (a) Various scenes, (b) Music video scenes. 

As described in Table 1, most of compressed videos have the small motion under 
the block size. Except the fast scenes like as music videos, the probability that the 
MVs in sequence are less than the block size is more than 99%. In music videos, in 
case of the middle speed scenes, the probability that the MVs are smaller than the 
block size is over 80%. But, as shown in Table 1, the probability that motion is 
smaller than the block size is under 70% in music videos with fast scenes. That is, the 
application of the proposed method in fast scenes has the limitation, when standard 
deviation is approximately over 7. Nevertheless, the proposed method is clearly 
available for the image analysis in global motion characterization, not in the local 
motion characterization in each frame. That is because the maximum motion of MVs 
hardly strays off MB size. 



Table 1. Normal distribution of MV. 





M 


o 


P(-8<£) <8) 


P(-16< D <16) 


News 


0.03 


0.69 


0.99 


0.99 


Movies 


0.05 


2.04 


0.99 


0.99 


Mn.l- 


-0.98 


9.05 


0.68 


0.94 


Middle 


-0.53 


6.99 


0.79 


0.98 


Slow 


-0.38 


5.48 


0.89 


0.99 
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3 Experimental Results 

In this chapter, we present the effectiveness and performance for the proposed motion 
flow method. As for the test sequences, our experimental database consists of MPEG- 
1 coded video sequences, which have the various camera work and object motion. 



Comparison of the Number of N-MVs. To evaluate the effectiveness of the 
estimation, we provide ground truth and compare it to the results from the flow 
estimation steps. Using the original uncompressed image frames, we encode the 
frames into MPEG files, which all B frames are replaced by P frames using a MPEG 
encoder. IBBPBB ordering (IPB encoded frame), for example, then becomes IPPPPP 
ordering (IPP encoded frame). We apply our flow estimation steps to the files in IPB 
format, and we compare the flow vectors of the frames between the two encodings. 

The number of N-MVs in our approach has increased over 18% than it in [8] as 
listed in Table 2. In detail, the number of N-MVs in R., B~B^ j and obtained by the 
proposed method has increased over 8%, 6%, and 44%, respectively. Especially, the 
number of N-MVs has markedly increased in B^ frame. This result originates in the 
fact that most of the MBs in frames have forward-predicted motion vector from R- 
rather than backward-predicted motion vector from R.. The more the B^ frame has a 
lot of forward-predicted motion vectors, the better the performance of the proposed 
algorithm is. 

, „ , , T, • Number of MV in IPB frame 

MV Detected Ratio = 

Number of MV in IPP frame 

MV detected ratio (MDR) has been calculated by the comparison of the number of 
N-MVs between IPP encoded frames and IPB encoded frames. This is not considered 
whether the MVs extracted from the two encodings are identical in the directions of 
the corresponding MBs or not. 



The Verification of the Validity for N-MVs. The effectiveness of the proposed 
method isn’t verified by the increase of the number of N-MVs. For the verification, 
we have compared the directions of the corresponding MBs. At first, we quantize the 
vectors of the two encodings in the several principal directions (presently 4 bins), and 
compare them. If they have the identical principle direction, we regard them in IPB 
encoded frame as the effective MVs. Using these effective MVs, we can obtain the 
MV effective ratio (MER). 

■ Number of Effective MV in IPB frame 

MV Effective Ratio = 

Number of MV in IPP frame 

The result of the experiments that verifies the validity of N-MVs is numerically 
summarized in Table 3. Examples of flow estimation are shown in Fig. 5. Fig. 5-(b) is 
a MB image that is derived from re-encoded IPP format files. Fig. 5-(c) and Fig. 5-(d) 
are its corresponding MB images from the IPB encoded streams. Fig. 5-(c) and Fig. 5- 
(d) show the MV flows in the proposed method and in Kobla’s method, respectively. 
Being compared with Fig. 5-(b) which is the ground truth, we can see that the more 
accurate flow estimation is accomplished in the proposed method, not in Kobla’s 
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method. The shade of the MB in Fig. 5 represents the direction of the flow vector. 
Fig. 6 is the comparison result of the MV histogram in IPB sequence and IPP 
sequence. To investigate the accurate result, we quantize the MVs in more fine 
directions. The simulation result shows our approach is closer to the ground truth 
sequence which is IPP encoded frame. 



Table 2. Comparison of MDR. 





Proposed method 


Kobla’s method [8] 




Avg. MV num 


MDR 


Avg. MV num 


MDR 


R, frame 


198 


0.71 


176 


0.63 


Bj frame 


216 


0.78 


199 


0.71 


B^ frame 


219 


0.79 


98 


0.35 


Total frame 


212 


0.76 


163 


0.58 


Table 3. Comparison of MER. 






Proposed method 


Kobla’s method [8] 


R frame 


0.88 


0.81 


B; frame 


0.75 


0.60 


B frame 


0.82 


0.43 


Total frame 


0.81 


0.61 




ij-i' '], Ml 
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(c) 




Fig. 5. Examples of flow estimation (From left to right : R., frames), (a) Original Image, 

(b) MV vector in IPP encoded frame, (c) Estimated MV vector in IPB encoded frame (Proposed 
method), (d) Estimated MV vector in IPB encoded frame (Kobla’s method). 
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Bin 



Fig. 6. Histogram comparison in N-MV field. 



The Application of N-MVs. Finally, we show the object tracking method as one of 
applications using the N-MV field. Fig. 7 depicts the effective object tracking in the 
proposed motion flow and, at once, attests the validity of the proposed N-MVs 
representation by comparing with the tracking method using MV field in IPP encoded 
frames. The object tracking using only forward-predicted MV in P frame on IPP 
encoded sequence is shown in Fig. 7-(a). This differs little from Fig. 7-(b) using the 
motion flow by our BPIF\ rather, it has lither curve in the aspect of the moving 
tracking. This means that the proposed motion analysis method doesn’t seriously 
transform and distort the real MVs on MPEG domain and is sufficiently reasonable. 
The object tracking method in this paragraph has referred to [11]. 




(a) fb) 



Fig. 7. Comparison of object tracking, (a) Object tracking in IPP encoded frame, fb) Object 
tracking in N-MV field. 



4 Conclusions 

In this paper, we have proposed the motion analysis method on the ground of 
normalizing the MVs on MPEG compressed domain. We have proposed the Bi- 
directional Predicted-Independent Eramework for generating the structure in which I, 
P, and B frames can be considered equivalently. Simulation results show that our 
frame-type-independent framework enables to represent the dense and comprehensive 
motion field. These N-MVs can be used as the motion depicter for video indexing, 
which have the strong point of computational efficiency and low storage space. 
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Abstract. We explore the impact of including hue in a feature con- 
struction algorithm for colour target detection. Hue has a long standing 
record as a good attribute in colour segmentation, so it is expected to 
strengthen features generated by only RGB. Moreover, it may open the 
door to infer compact feature maps for skin detection. However, contrary 
to our expectations, those new features where hue participates tend to 
produce poor features in terms of recall or precision. This result shows 
that (i) better features can be constructed without the costly hue, and 
(ii) unfortunately a good feature map for skin detection is still evasive. 



1 Introduction 

It is well-known that pixel based skin detection plays an important step in several 
vision tasks, such as gesture recognition, hand tracking, video indexing, face 
detection, and computer graphics, see e.g. [3,4,9,14,23,25]. Although colours may 
vary due to ambient light, brightness, shadows or daylight, skin detection is 
computationally tractable. Practitioners had traditionally used just existing off- 
the-shelf colour models, like HSV, YUV, raw RGB or normalised RGB, which 
however often yields poor precision^ to upper layers in their vision systems. 

We know there is no single colour model suitable for skin detection [6], and 
that traditional models are not so useful [1,21] for this task. Fortunately, machine 
learning community had developed the concept of attribute construction, which 
in this context means to override all existing attributes and infer new ones for 
the task at hand. However, which attributes should he used to infer a good model 
for skin detection? 

We present a computational study where R, G, B and hue participate within 
an attribute construction approach. Guided by existing literature, we expect a 
clear improvement over features generated with only R, G, and B. Hereafter, we 
shall refer to “attribute” as raw input variables, and “feature” to any combina- 
tion of attributes. 

^ Precision = j<p^pp x 100%, where TP = true positives; and FP = false positives, 
where the prediction is incorrectly set as skin 
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This report will review, in section 2, the crossroads of colour detection and 
machine learning. Then, in section 3, the main premise and expectations are 
presented. Two influential tools from machine learning, attribute construction 
and attribute selection, are described in sections 4 and 5, respectively. Experi- 
mental settings and findings appear in section 6, while conclusions to this work 
are found in section 7. 

2 Improving Colour Detection 

Although eliminating the illumination components has been a popular practice 
for skin detection, it has not really improved our vision systems. What is even 
worst, it has been reported that this practice actually decreases the separability 
in some models [21]. The other way around, adding components in a stepwise 
procedure have the same problems. For instance, what you gain with Cr or hue 
is lost by adding its second element, Cb and Saturation, respectively. In this 
direction there are some bad news, see e.g. [1,21], that many practitioners have 
sadly noticed, no matter which existing colour model we use, the separability of 
skin and non-skin points is independent of the colour model employed. 

Of course, we may argue that a long term solution would be to look at 
different wavelengths. But, for the time being, red-green-blue response prevails 
as the standard input value. Thus, if existing general colour models do not 
help for target detection one may be tempted to create synthetic ones. Say, 
these may not be reversible, not useful for other tasks, nor adding anything new 
to Colour Science. Of course, those synthetic models have not to be created 
by guessing combinations in unsound way but with a systematic procedure. In 
general, any feature space tuned for this purpose will hereafter be cited as “skin 
colour model” . 

2.1 Learning and Colour 

Since our raw attributes are not very helpful for skin and non-skin discrimination, 
we may therefore transform RGB into a nonlinear mapping. Projecting RGB 
into higher spaces may result in easier decision boundaries. A line in these new 
spaces is in fact a non-linear one in the original space. It is known from Statistical 
Learning Theory that a hypothetical polynomial of sufficiently high degree do 
approximate any arbitrary decision boundary, no matter of how complex it is. 
Nevertheless, for our purposes two practical questions arise, (1) how do you 
create such a non-linear features in a constructive manner, and (2) how do you 
get only few of them to describes a good decision boundary. 

Two recent steps have been done to bring machine learning approaches to 
help colour detection [5,29]. While in both cases colour targets were fed into a 
decision tree induction system (c4.5 [19]), the main difference is the attribute 
construction step. In [29] the RGB stimulus was directly transformed into hue, 
saturation, and average values of R, G, and B. Resulting decision trees use those 
attributes, e.g. moment of inertia, to classify colour targets. 
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Since their work uses pre-existing models, it is more in the sense of the 
Michalski’s view of pattern discovery in computer vision [15]. However, under this 
idea, we assume that the current description language is good enough to describe 
our colour targets. A different path was adopted in [5], and no pre-existing 
models were assumed. Then, as their induction systems progress, new and more 
specific features are constructed. In each cycle, current features are passed to 
an attribute selection step, and only those which exhibit good performance are 
allowed to continue in the process. Although both approaches are bit costly 
in terms of computing time (e.g. attribute selection and tree induction), it is 
worthwhile to explore machine learning ideas to automate colour detection. 

3 Why Not Hue? 

As mentioned before, authors in [5] did use RGB to create new features. However, 
we may criticise why they did not include hue in their initial set. Hue has a long 
standing record of good colour attribute. Moreover, it has been recently assessed 
as one of the most influential attributes in a survey [6]. Thus, it seems quite 
normal to include hue and propose two questions: 

1. does Hue contribute to infer better colour spaces for skin detection? 

2. whether or not is possible to infer a 2D skin space. 

The intuition behind adding hue is far clear. Hue may be used as a short- 
cut to get more compact features. It is an obvious candidate for any induction 
system. Many other colour attributes are easily derived as a linear combination 
of RGB, meanwhile Hue is quite more complex to infer. By allowing hue to 
participate in the attribute construction, one may expect powerful features with 
better recall, precision and success measures Further, as a by-product, one may 
therefore expect to find out a good 2D model. Many current approaches to skin 
detection use three components (e.g. [5,6,14]), or a 2D where the illumination 
one has been just removed. 

4 Attribute Construction 

Most constructive induction systems use boolean combinations of existing at- 
tributes to create new ones, e.g. [18,20,27,28]. Say, their constructive operators 
can form conjunctions and/or disjunctions of attributes (e.g. [11,12,18,20]) or 
even use more sophisticated operators such as M-of-N [16] and X-of-N [27]. M- 
of-N answers whether at least M of the conditions in the set are true. X-of-N 
answers how many conditions in this set are true. Although a large number 
of studies have been devoted to boolean combinations of attributes (e.g. [28]), 
there are very few systems that use arithmetic combinations of real-value ones, 
which normally occur in vision. Most notably is the Bacon system [13] which 

^ Recall = X 100%, where TP = true positives. 

Success rate = , where FN = false negatives. 
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searches for empirical laws relating dependent and independent variables. Bacon 
finds increasing and decreasing monotonic relations between pairs of variables 
that take on numeric values and calculates the slope by relating both terms to 
create a new attribute. Once a functional relation between variables is found, 
it is taken as a new dependent variable. This process continues until a complex 
combination is found relating all the primitive attributes. 

In this paper we start with hue and the three basic color components RGB in 
a normalised form, and a simple set of arithmetic operators to produce a suitable 
model for pixel based colour detection. Once a new set of attributes is produced, 
a restricted covering algorithm (RCA, [5]), is used to construct single rules of no 
more than a small number of easy to evaluate terms with a minimum accuracy. 
We are interested in inducing simple models as they are relevant to applications 
which require fast response times, such as, semi-automatic calibration of colour 
targets, gesture recognition, face and human tracking, etc. 

The general approach followed in this paper for constructive induction is 
shown in Table 1. The idea, is to start with some primitive attributes and a set 
of constructive operators, create a new representation space, run an inductive 
learning algorithm, and select the best attributes of this new space. This process 
continues until a predefined stopping criterion. 

Table 1. General constructive induction idea, (i) the machine learning algorithm, (ii) 
the constructive induction module, and (iii) an evaluation component. 

CurrentAttrib = original attributes, i.e. { r+g+b ’ Ftf+S’ r+g+6 ’ 

Operators = set of constructive operators, i.e. 

UNTIL termination criterion 

• NewAttrib = CurrentAttrib U new attributes 

constructed with Operators on CurrentAttrib 

• Run a machine learning algorithm on NewAttrib 

• CurrentAttrib = Select the best attributes 

from NewAttrib 

The constructive induction algorithm starts with hue, and 

All of them were used to create new attributes by seven constructive 
operators: A + B, A* B, A — B — A, A/B, B/A, and A^, where A and B can 
be any pair of distinct attributes. 



5 Attribute Selection 

While it is relatively easy to create new features, their evaluation is a very time 
consuming step. It is the internal loop in these induction systems, say, every new 
hypothesis or features have to be assessed in their goodness to discriminate the 
target classes. There are basically two main approaches in attribute selection: 
filters and wrappers. Filters rank variables independently of their later usage. 
Conversely, wrappers guide their attribute selection process specifically for the 
machine learning algorithm employed. Generally speaking, filters are faster than 
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wrappers. However, it is commonly assumed that wrappers do offer better pre- 
dicting performance, i.e. the selected subset is tightly tuned for the machine 
learning algorithm and thus for the predictor too. Additional information on 
attribute selection can be found in one recent survey [8], and a special issue on 
Variable and Feature Selection in [10]. 

In this paper we adopt a wrapper approach using an information gain heuris- 
tics. The resulting representation is a tree-like structure, generated by RCA, 
which is chosen because of two main advantages: (i) simplicity in both represen- 
tation and computing requirements, and (ii) able to produce a range of models 
for the target class. The algorithm is briefly described within the next paragraphs. 

5.1 RCA 

The general strategy of RCA is to favour attributes which cover a large number 
of true positives and attributes with small number of false positives. We are 
interested in single rules, so we shall talk about the total number of true positives 
(TTP) which will be used to increase the measure of recall and the total number 
of false positives (TFP) which will be used to increase precision. 

Since we are dealing with real- value attributes, RCA creates binary splits 
using an information gain heuristics (as C4.5 does, [19]). RCA considers two 
possible attributes in parallel when constructing rules. On its first cycle, RCA 
constructs two rules which have as LHS the attribute with larger TTP in one 
rule and the attribute with larger TTP — TFP^ in the other rule. The following 
cycle produces two rules out of each original rule (4 in total) following the same 
criterion, again adding to the LHS of each rule one attribute with large coverage 
and one which is heavily penalized by the number of misclassifications. This pro- 
cess continues until the rules produced have a certain number of predetermined 
terms. The upper bound of rules to be produced is 2", where n is the number of 
terms on each rule. RCA builds 2" rules in parallel aiming for a large coverage 
with small errors on the same example set. This idea handles two objectives, 
thus it produces a range of alternatives, potentially from high recall-poor preci- 
sion to high precision-poor recall, and balanced intermediate states, of course. 
An overall description of RCA is given in table 2. 

5.2 Connection to Other Methods 

The idea of improving features on-the-fiy is not new. To our best knowledge, it 
appeared in the MOLFEA project, an inductive database system, e.g. [11,12]. 
However, they use a boolean conjunction of predefined attributes to generate new 
ones. We share the aim to generate features not in advance but “on demand”, 
by detecting the need for a representation change, but nevertheless we do use 
arithmetic instead of boolean combinations. 

Other works [17,22] did use genetic programming as a preprocessing step. 
They construct new features from the initial dataset, generating “potentially” 
useful features in advance. A genetic algorithm is then used to control the at- 
tribute selection step. Here, a binary chromosome represents attributes, say, the 
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Table 2. Overall description of RCA. Two intermixed criteria to induce rules with 
complementary attributes. 

For each class C 

Let E = training examples 

Let N = maximum number of terms per rule 

Create a rule R(0) with empty LHS and class C 

Let depth D — 1 

Until D = N do 

For each attribute A create a split {Spa) 
with greater information gain 
For each existing rule R{D — 1) 

create two new rules (Ri{D) and R 2 {D)) by 
adding to its LHS, a Spi with larger TTP 
{Ri{D)) and a Spj with larger TTP — TFP^ 

{R2) 

Let D — >■ D + f 

For each Ri{D) continue with its own 
covered examples from E 
Output all Ri{D) 

0/1 or (false/true) slot means whether the corresponding attribute does par- 
ticipate in the wrapper process. This idea does not introduce any bias in the 
feature construction in a class-depending or goal-driven fashion, as previously 
mentioned. 

Although an open avenue would be to use other (e.g. evolutive) ideas for 
attribute construction and selection, which may suits well in this context, we 
feel that our proposed technique is rather straightforward (see table 2) and avoid 
unnecessary costly operations. 



6 Experimental Settings and Results 

We create ten subsets with skin and non-skin elements. The dataset is described 
in [6], which is based on real skin images, from different input sources, illumi- 
nation conditions and races, with no photographic manipulation. For each data 
set, we selected 33000 skin and 67000 non-skin elements uniformly at random. 
We perform 10- fold cross validation in the attribute selection step, the inner 
loop. The covering algorithm, RCA, wraps c4.5 for attribute selection, which 
ran with the usual pruning and confidence thresholds. In addition, we request a 
minimum number (500) of elements per leaf to force fewer leaves. Final results 
of each induced model were calculated on large and balanced unseen data: 12.2 
million points for each skin and non-skin targets. 

Table 3 shows the best bi-dimensional models found by RCA. RCA is de- 
signed in such a way that one feature can appear twice in the same branch. 
Hence, it introduces a double threshold. Only two leaves were generated with 
this concept. To our surprise, the best precision on both 2D and 3D models is 
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Table 3. Only two models had two variables, yet one of them has the higher precision 
among all experiments. 



colour models (two components) 


recall precision success rate 




(%) 


(%) 


(%) 


u 

(r+g + b)'-^ 


80 


92.6 


86.3 


gb h*(r+g-\-b) 

(r+g + 6)2 b 


97 


74.8 


82.7 



Table 4. Best features generated by the attribute selection procedure. Only the last 
row exhibits a well-balanced performance. 



color model (three components) 



recall precision success rate 



(%) (%) (%) 





Yi 


r 


95.6 


88.6 


91.7 


(r+g + 6)2 




b 


gb 


h.+ (r+3+&) 


r 


98.2 


77.9 


85.2 


{r+g + b)'-‘ 


b 


9 


h*{r+g + b) 

b 


r 

9 


h 


98.1 


82.9 


88.9 




gb 


rg 


98 


78.8 


85.3 




(r+g + 6)2 


(r+g + b)^ 




gb 


b — r 


98 


64.1 


72 




(r+g + 6)^ 


(r+g + b) 


Y^ 


gb 


/i + (r+5f + 6) 


98.1 


65.3 


73.4 




(r+g + 6)2 


9 


h 


r 

b 


r — b 

(r+g + b) 


94 


90.5 


92.4 



the first model in table 3. Normally, what is expected is a high recall, but in this 
case and hue contributed to achieve 92.6% of precision. 

The same features in table 3 were selected in other branches, as shown in 
first two lines of table 4. Interestingly, better models were found by removing 
these double thresholds and selecting a new variable instead. Thus, the overall 
effect does eliminate a threshold on and increases the recall. We should 

be critic with the second row in table 4, in which recall is rather high. It is not 
a surprise since even a plain ratio like ^ can achieve more than 95% in recall, 
but at expenses of poor precision, as in this case. Interesting point should have a 
balance in both recall and precision, which in turn will lead a good success rate. 

We should compare tables 3 and 4 to state of the art models, shown as table 5. 
An attribute construction and selection approach appears in [5], and its finding 
is listed in table 5-top. The second row (from [6]) shows a model found by a 
stepwise forward selection method, and consists of hue, GY and Wr, which are 
defined as: 
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Table 5. State of the art models: (top) Automatic feature construction from RGB, 
(middle) Step-wise forward selection on several colour components, (bottom) Skin 
Probability Map, which is extremely fast. 



other models (three components) 



recall precision success rate 



(%) (%) (%) 



r 


rb 


rg 


93 


91.5 


92.2 


9 


(r+g + b)^ 


(r+g + 6)2 


h 


GY 


Wr 


93.2 


92.1 


92.6 




SPM 


on raw RGB 


95.8 


77.3 


91 



GY = -0.30 *r + 0.41 *g- 0.11 * b 



Wr = {^ -r+{^ V 

^r + g + b 3^ W + g + b 3^ 



Unfortunately, features like Wr are very difficult to infer with the existing 
attribute construction scheme. Third row in table 5 shows a pragmatic and well- 
known approach, so called Skin Probability Map (SPM), working on raw RGB 
values. The SPM has a threshold variable which is tuned for this comparison. 
The learning procedure used all ten subsets, instead of only one. This is somehow 
an unfair comparison, but that is why SPMs work, see [6,14], e.g. skin in RGB 
has a very sparse distribution. As a common practice in SPMs, we used the God 
given parameter [14] of 32 equally sized histogram bins, i.e. 32^. 



6.1 Discussion 

With those results on hand we may come back to the original questions. How 
does the hue contribute to new colour features? Essentially, there is no real 
impact of including hue in new features. 

Although some good features have been found, we expected better results, 
say, by consistently exceeding a mark of 90% in both recall and precision. Two 
existing models, shown in table 5 do achieve this mark, and one of them does 
not use hue at all. Only one model in table 4 (last one, top to bottom), exhibit 
competitive results. And, at best, it is comparable to table 5-top, which does not 
use hue and thus faster to compute. Moreover, some doubts may arise with the 
inclusion of a noisy feature, which nevertheless was selected in two models. 

Although promising features appear in table 3, a good enough 2D skin space 
is unfortunately still evasive. The success of this study relies in creating strong 
features, which were not produced using hue. In fact, the attribute selection 
shows a strong bias to ^*P+9+^) and (ffP+9+U^ Nonetheless, their associated 
models show regular to poor performance. 
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7 Conclusions 

We report an attribute construction experience with the aim of finding good 
colour features for pixel based skin detection by including hue in the initial 
subset. Unfortunately, with our methodology, we found that hue has a minor 
contribution in novel features. Moreover, only one model is at best comparable 
to the existing literature in this field, which does not use hue. Thus, it indicates 
that (i) better features may be constructed without the costly hue, and (ii) 
unfortunately a 2D skin colour model is still evasive. 

Skin colour processing is an active field. We encourage other people to verify 
and extend this line by including different features (e.g. texture) in the attribute 
construction scheme or, perhaps, developing novel attribute selection methods 
that overcome the initial bias to terms with hue. 
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Abstract. In this paper, the use of morphological contrast mappings 
and a method to quantify the contrast for segmenting magnetic resonance 
images (MRI) of the brain was investigated. In particular, contrast trans- 
formations were employed for detecting white matter in a frontal lobule 
of the brain. Since contrast mappings depend on several parameters (size, 
contrast, proximity criterion), a morphological method to quantify the 
contrast was proposed in order to compute the optimal parameter values. 
The contrast quantifying method, that employs the gradient luminance 
concept, enabled us to obtain an output image associated with a good 
visual contrast. Because the contrast mappings introduced in this article 
were defined under partitions generated by the flat zone notion, these 
transformations are connected. Therefore, the degradation of the output 
images by the formation of new contours was avoided. Finally, the ra- 
tio between white and grey matter was calculated and compared with 
manual segmentations. 



1 Introduction 

In mathematical morphology contrast enhancement is based on morphological 
contrast mappings as described by Serra [17]. The main idea of these transfor- 
mations is the comparison of each point of the original image with two patterns; 
subsequently, the nearest value with respect to the original image is selected. The 
first works dealing with contrast theory were carried out by Meyer and Serra [12]. 
Indirectly, a special class of contrast mappings denominated morphological slope 
filters (MSF) were introduced by Terol-Villalobos [20] [21] [22]. Here a gradient 
criterion was used as a proximity criterion. Moreover, in Terol-Villalobos [22], 
the flat zone concept on the partition was introduced in the numerical case and 
the morphological slope filters were defined as connected transformations. Once 
the basic flat zone operations were defined on the partition in the numerical 
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case, the morphological contrast mappings were proposed as connected trans- 
formations by Mendiola and Terol [10] [11]. One important difference between 
the contrast mappings proposed by Serra[17] and those proposed by Mendiola 
and Terol [10] [11] was that for the latter the size of the structuring element was 
considered a variable parameter in the primitives as well as in the proximity 
criterion. However, this originates a problem since, appropriate values for these 
parameters must be calculated. This problem is solved in the present paper by 
the proposal of a morphological quantitative contrast method, which is used to 
determine some adequate values for the parameters involved in the morphologi- 
cal contrast mappings. Numerous models of contrast metric based on the human 
visual system have been proposed, they are mostly concerned with predicting our 
ability to perceive basic geometrical shapes and optical illusions. Few of these 
models work properly; mainly because the human visual system is enormously 
complex. Nevertheless, there have been several attempts providing reasonable 
successful models, some examples can be found in [2], [3], [6], [15], and [19]. In 
our case, the main purpose of introducing a morphological quantitative contrast 
model is to have a contrast measure useful in the determination of the output 
images presenting an enhancement in the contrast from the point of view of 
visual contrast. The morphological quantitative contrast measure introduced in 
this work uses the concept denominated luminance gradient (see [1]); the con- 
trast model will be used to determine some optimal parameters associated with 
contrast mappings, which will be applied to detecting white matter located in 
a frontal lobule of the brain. On the other hand, several techniques of image 
processing have been employed for segmenting MRI of the brain. A complete de- 
scription dealing with this subject is presented in [5], [7], and [14]. At the present 
time, the most widely applied technique is manual segmentation of the brain, 
which has several disadvantages: (i) it generally requires a high level of expertise, 
(ii) it is time and labor consuming, and (iii) it is subjective and therefore not 
reproducible. The advantages of employing an automatic or semi-automatic seg- 
mentation approach are its reproducibility and readiness, which are useful when 
the specialist in the area needs to measure, diagnose and analyze, an image in a 
study. Some works on MRI segmentation of the brain related to mathematical 
morphology can be found in [4], [8], [13], among others. In this paper we propose 
a mophological contrast measure and the application of contrast mappings in 
order to segment white and grey matter in a frontal lobule of the brain. The 
quantification of white and grey matter in frontal lobule provides to experts in 
the area important information concerning memory impairment related to aging. 
This paper is organized as follows. Section 2, briefly presents the basic morpho- 
logical transformations defined on the partition and the morphological contrast 
mappings. In section 3, a morphological method to quantify the contrast is in- 
troduced. Finally, in section 4 an application of brain MRI is presented. In this 
case white and grey matter are segmented in a frontal lobule and their ratio is 
quantified and compared with manual segmentations obtained by experts from 
the Institute of Neurobiology, UNAM Campus Juriquilla. 
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2 Morphological Basic Transformations on the Partition 

2.1 Connectivity 

Serra [18] established connectivity by means of the concept of connected class. 

Definition 1 (Connected class). A connected class C in p{E) is a subset of 
p{E) such that: 

(i) 

(ii) Vx G E,{x} G C 

(Hi) For each family {Ci} in C, P| Cj yf 0 IJ G C 

where p{E) represents the set of all sets of E. An element of C is called a con- 
nected set. An equivalent definition to the connected class notion is the opening 
family expressed by the next theorem [18]. 

Theorem 1 (Connectivity characterized by openings). The definition of 
a connectivity class C is equivalent to the definition of a family of openings 
{lx,x G E} such that: 

(d) Vx € ^ 

(h) yx,y G E and A C E,'y^{A) = jy{A) or 'y^{A) f]jy{A) = 0 
(c) VA C E and Vcc G E, Vx ^ A => lx{A) = 0 

When the transformation is associated with the usual connectivity (arc- 
wise) in (Z is the set of integers), the opening 7a; (A) can be defined as the 
union of all paths containing x that are included in A. When a space is equipped 
with 7a;, the connectivity can be expressed using this operator. A set A C 
is connected if and only if 7 a; (A) = A. In Fig. 1 the behavior of this opening is 
illustrated. The connected component of the input image X (Fig. 1(a)), where 
point X belongs, is the output of the opening 7 a,(AT), while the other components 
are eliminated. 

Definition 2 (Partition). Given a space E, a function P : E ^ p{E) is called 
a partition of E : (a) if x G P{x),x G E, (b) if P{x) = P{y) or P{x)()P{y) = 0 
with x,y G E. 

P{x) is an element of the partition containing x. If there is a connectivity 
defined in E and Vcc, the component P{x) belongs to this connectivity, then the 
partition is connected. 

Definition 3. The flat zones of a function f : ^ Z are defined as the 

connected components (largest) of points with the same value of the function. 

The operator Fx{f) will represent the fiat zone of a function / at point x. 

Definition 4. An operator is connected if and only if it extends the flat zones 
of the input image. 
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Fig. 1. Connected components extraction. (a)Binary image X, (b) The opening y^iX) 
extracts the connected component in X where point x belongs. 



Definition 5. Let x he a point of equipped with . Two flat zones F^{f) 
and Fy{f) in Z'^ are adjacent if F^{f) f| Fy{f) = j^iF^if) U Fyif)) 

Note that, (a) x G F^{f), and (b) \/x,y,F^{f) = Fy{f) or F^{f) f]Fy{f)) = 
0. Therefore, the flat zone notion generates a partition of the image. Thus, the 
use of both concepts, flat zone and partition, were used for the introduction of 
morphological transformations in the grey level case. 

Definition 6. Let x he a point in Z"^ equipped with 7^,. The set of flat zones 
adjacent to F^ is given hy, = {F^' : x' G Z"^ , F^[j F^, = j^iF^ U 

In Fig. 2 the adjacent flat zone concept is illustrated. Two adjacent flat 
zones are presented (see Fig. 2(b) and 2(c)), and the adjacency of the expression 
Fxif) U Fyif) = lx{Fx{f) U Fy{f)) is also illustrated in Fig. 2(d). 

In the case of working on the partition, the transformations should be ope- 
rated on the pair (/, P/) and the element {f,Pf){x) is taken as the grey level 
value of the connected component F^(f). The morphological dilation and erosion 
applied over the flat zones are given by: 

=max{{f,Pf){y) : Fy G A,^|J{P^}}, (1) 

= min{{f,Pf){y) : Fy G A^\j{F^}}, (2) 

The dilation and erotion of size y are obtained iterating y times the 
elemental dilation and erosion given in equations (1) and (2): 

5^U,Pf){x) = 55--5U.Pf){x) (3) 

^ ^ 

fi times 

£uif^Pf)i^) = ■4f^Pf)i^) (4) 

fi times 

The opening and closing on the partition of size y induced by / are: 

'yM^Pf)i^) = '5M(e/.(/,P/),P/)(a:), (5) 

Tnif^Pf)i^) = Sui^M’Pf)’Pf)i^)^ ( 6 ) 

The morphological external gradient of size y on the partition is defined as: 

grade y{f,Pf){x) = Sf,{f,Pf){x) - (/,P/)(x) (7) 

The basic morphological transformations defined in this section enable us to 
present the three states morphological contrast mappings in the next section. 
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Fig. 2. Adjacent flat zone concept. (a) Image / with 14 flat zones, (b) Flat zone 
in point x, Fx{f), (c) Flat zone in point y, Fy{f), (c) Two adjacent flat zones, i.e, 

F4f){jFvif) = y^iFM)[jFyif))- 



2.2 Morphological Contrast Mappings 



As was expressed in the introduction, the contrast mappings consist of the se- 
lection of some patterns (primitives) for each point of the image in accordance 
with a proximity criterion. The selection of the primitives is very important, 
since the degradation of the output images can be attenuated if the primitives 
are idempotent transformations as described by Serra(see [17]). On the other 
hand, in Mendiola and Terol (see [10] [11]) two and three states morphological 
contrast mappings with size criteria on the partition were proposed. The proxi- 
mity criterion with size criteria basically allows a different performance of the 
morphological contrast mappings, hence providing an alternative way of modify- 
ing the contrast in an image. As follows three states contrast mappings with size 
criteria on the partition are considered. These contrast mappings are composed 
by three primitives, opening and closing on the partition and original image (see 
equations (5) and (6)). The proximity criterion p(x) (see equation (8)) considers 
the bright and dark regions of the image. Note in equation (8) that a ratio factor 
in each point of the image is calculated. 






( 8 ) 



Expession (9) establishes a three states contrast mapping with size criteria 
on the partition . 






V^i^i{f,Pf)(x) 0<p{x)<(3 
if,Pf){x) P<p{x)<a 
lt^ 2 if^Pf)i^) a<p{x)<l 



( 9 ) 



The main advantage of working on the partition is that the flat zones of 
the image will never be broken during their processing, and the generation of 
new contours into the output image will be avoided. The former situation occurs 
because the employed transformations are connected. From equations (8) and 
(9) notice that four parameters exist to determine, pi, p 2 , ct and (3. Parameters 
Pi and p 2 are obtained from equation (10) by means of a graphic method. The 
traditional way of studying structures sizes constituting the image is by means 
of the granulometric study of the image (see [16]). 
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The graphic method for determining the structures sizes of the processed 
images in this work consists in calculating the volume point by point in the image 
processed by means of the equation (10), in this case the closing size is fixed while 
the opening size changes. Expression (10) works similar to granulometric density 
and allows obtaining of the size of the opening or closing on the partition when 
one of the index is fixed, while the other index changes. 






(10) 



An estimation of the parameters a and /3 involved in equation (9) is obtained 
by means of the morphological contrast method proposed in section (3). 



3 Contrast Metric 

Given a two-dimensional luminance distribution across a surface, the luminance 
gradient is defined as (see [1]): 



at if, — 

Aa; b — a 



where La and Lf, are the luminances at two closely spaced points a and b on the 
surface separated by a distance Ax = b — a (the absolute value is necessary to 
eliminate any directional dependence). When Ax — >■ 0: 



AL 




dL 


Ax 


lim^x—¥0 


dx 



The changes in luminance are associated with the contours of the image, 
since they produce changes on the scene. One transformation that enables us 
to work directly with the contours of the image is the morphological external 
gradient (see equation 7). The next expression is proposed in order to have an 
indirect measure of the variations of the luminance gradient (VLG) . 

VLG= [maxgrade{f,Pf){x) — mingrade{f, Pf){x)] (11) 

xGDf 



Where, maxgrade{f, Pf){x) = max{gradef^{f, Pf){y)] Fy{f) G Aj,(/) IJ 
{Pxif)}} and mingrade{f,Pf){x) = min{gradef,{f, Pf){y); Fy{f) G A^{f) U 
{Fa; (/)}}; the element x belongs to the definition domain denoted hy Dj. The 
idea of this morphological quantitative contrast method consists of the selection 
of the best parameters associated with some value of VLG obtained from the 
graph VLG vs. parameters. The analysis of the graph will be mainly focused on 
its maxima and minima, since they are associated with substantial changes in 
intensity. The following steps are employed for the selection of the local maximum 
and minimum producing good visual contrast. 
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— Step 1.- Calculate and draw the graph VLG values vs. parameters for a set 
of output enhanced images. 

— Step 2.- A smooth visual contrast will correspond to the value of VLG 
associated with the global minimum in the graph VLG values vs. parameters. 

— Step 3.- A higher visual contrast will correspond to the value of VLG asso- 
ciated with the global maximum in the graph VLG values vs. parameters. 

The performance of this quantitative morphological contrast measure method 
is illustrated in the next section. 

4 MRI Segmentation 

In this section an application of contrast mappings (see section(2.2)) and mor- 
phological quantifying contrast method(see section(3)) is presented. In particular 
the detection of white matter in a frontal lobule of the brain was carried out. 
The brain MRI-Tl presented in this paper belongs to the MRI-Tl bank of the 
Institute of Neurobiology, UNAM Campus Juriquilla. The file processed and 
presented in this article comprised 120 slices, in which 17 belonged to a frontal 
lobule. The selection of the different frontal lobule slices was carried out by a 
specialist in the area of the same institute. The segmentation of the skull for 
each brain slice was carried out by means of the transformation proposed in 
[9], in such a way that our interest lies only on the segmentation of white and 
grey matter. The first five slices without skull of a frontal lobule are presented 
in Fig. 3(a). By means of equation (10), the size of the opening was calculated 
just for the first slice of the frontal lobule and applied to other brain slices. This 
approximation was made to avoid a large an inadequate process. The size of the 
closing on the partition was fixed with p. = 15. Note that several values for the 
closing on the partition can be tested; however greater size for the closing will 
give adequate segmentations. The graph of the volume of equation (10), when 
the closing size is /i = 15 and the opening size varies within the interval [1,12], 
is presented in Fig. 3(b). This graph shows that ^ = 7 is an adequate value for 
the size of the opening, since an important changes in the internal structures 
of the image may be appreciated. Once the adequate sizes for the opening and 
closing on the partition were determinate , the parameters a and (3 involved 
in the contrast mapping transformation defined in equation (9) must be found. 
The obtention of a and (3 was carried out by means of the morphological con- 
trast method proposed in section (3). In order to simplify the procedure to find 
the adequate values for a and [3, only the first slice of the frontal lobule was 
analyzed, and once such values were found, they were applied to the remain- 
der slices. This supposition was considered as an approximation to detect white 
matter in the different slices, without undertaking an impractical analysis. In 
accordance with step 1 of the morphological contrast method, the graph VLG 
vs. parameters must be obtained. In this work 12 values for VLG were generated, 
where parameters a and j3 took their values within the interval [0, 1]. The graph 
VLG vs. parameters is presented in Fig. 3(c). In accordance with step 2 and step 
3 of the proposed morphological contrast method , the maximum of interest can 
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Fig. 3. MRI Segmentation, (a) First five slices of a frontal lobule; (b)Graph of the 
volume of equation (10), the opening size varies within the interval [1,12]; (c) Graph 
VLG vs parameters-, (d) First five output images in which white matter is detected; 
(e) Separation of the white matter detected from the images in Fig. 3(d) ; (f) First five 
slices where grey matter is segmented 



be obtained from graph in Fig. 3(c), its value is 140248, corresponding to the 
parameters a = 0.117 and (3 = 0.47. Such values are associated with the image 
presenting the best visual contrast. The parameters a = 0.117 and /3 = 0.47 
were applied in equation (9) together with the opening size /i = 7 and closing 
size /i = 15 for detecting white matter in the slices of the frontal lobule. The first 
five output images where white matter was detected are presented in Fig. 3(d). 
In order to segment white and grey matter the next algorithm was applied. 



Algorithm to Segment White and Grey Matter 

(i) Get the threshold of the images in Fig. 3(d) between 90-255 sections. 

(ii) Introduce a mask between the original image in Fig 3(a) and the image 
obtained in step (i). 

(iii) Establish a point by point difference between the original image in Fig. 3(a) 
and the image in step (ii). 

(iv) Get the thereslhold of the images obtained in step (iii) between 70-255 
sections. 

(v) Introduce a mask between the images in Fig. 3(a) and the images obtained 
in step (iv). In this step the grey matter is segmented. 
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In Fig. 3(e) and 3(f) the segmentation of white and grey matter are presented. 
On the other hand, the quantification of white and grey matter was carried out 
slice by slice from images in Figs. 3(e) and 3(f). In this case, the pixels different 
from zero were counted. The volume of white matter amounted to 17614 pixels 
and that of grey matter to 22384 pixels; the ratio between grey and white matter 
was 1.270. The relation between white and grey matter was compared with a 
manual segmentation achieved by an expert in the area, the comparison gave a 
variation of +5% with respect to the manual segmentation. In this paper the 
segmentation of only one frontal lobule is presented, however, the same procedure 
was applied for segmenting four frontal lobules. In these segmentations, the ratios 
between white and grey matter presented a variation of ±5% with respect to the 
manual segmentations. The segmentation of white and grey matter, as well as 
the ratios between white and grey matter were validated by an expert of the 
Institute of Neurobiology, UNAM Campus Juriquilla, Queretaro. 

5 Conclusion 

In the present work, morphological contrast mappings and a proposed morpho- 
logical quantitative contrast method were used to segment MRI of the brain in 
order to obtain a segmentation of white and grey matter located in a frontal lob- 
ule. The morphological contrast method was useful to determine some important 
parameters that define the action interval of the proximity criterion utilized in 
the morphological contrast mappings. The ratio between grey and white matter 
was compared with a manual segmentation carried out in the Institute of Neuro- 
biology, UNAM Campus Juriquilla, obtaining an adequate result. The procedure 
proposed to segment white and grey matter has the disadvantage of not being 
completely automatic. 
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Abstract. Gaze detection is to locate the position (on a monitor) of 
where a user is looking. This paper presents a new and practical method 
for detecting the monitor position where the user is looking. In general, 
the user tends to move both his head and eyes in order to gaze at certain 
monitor position. Previous researches use one wide-view camera, which 
can capture a whole user’s face. However, the image resolution is too 
low with such a camera and the fine movements of user’s eye cannot be 
exactly detected. So, we implement the gaze detection system with dual 
camera system(a wide and a narrow- view camera). In order to locate 
the user’s eye position accurately, the narrow-view camera has the 
functionalities of auto focusing/pan/tilting based on the detected 3D 
facial feature positions from the wide-view camera. In addition, we use 
IR-LED illuminator in order to detect facial features and especially eye 
features. To overcome the problem of specular reflection on a glasses, 
we use dual IR-LED illuminators and detect the accurate eye position 
with escaping the glasses specular reflection. From experimental results, 
we implement the real-time gaze detection system and obtain the gaze 
position accuracy between the computed positions and the real ones is 
about 3.44 cm of RMS error. 

Keywords: Gaze Detection, Dual Camera, Glasses Specular Reflection, 
Dual IR-LED Illuminators 



1 Introduction 

Gaze detection is to locate the position where a user is looking. This paper 
presents a new and practical method for detecting a point on the monitor where 
the user is looking. In human computer interaction, the gaze point on a monitor 
screen is a very important information. Gaze detection system has numerous 
fields of application. They are applicable to the interface of man-machine in- 
teraction, such as the view control in three dimensional simulation programs. 
Furthermore, they can help the handicapped to use computers and are also 
useful for those whose hands are busy doing other things[18]. Previous studies 
were mostly focused on 2D/3D head rotation/translation estimation[l][14], fa- 
cial gaze detection[2-8][15][16][18][23] and eye gaze detection[9-13][17][21][24-29]. 
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However, the gaze detection considering head and eye movement simultaneously 
has been rarely researched. Ohmura and Ballard et al.[4][5]’s methods have the 
disadvantages that the depth between camera plane and feature points in the 
initial frame must be measured manually and it takes much time(over 1 minute) 
to compute the gaze direction vector. Gee et al.[6] and Heinzmann et al.[7]’s 
methods only compute gaze direction vector whose origin is located between the 
eyes in the head coordinate and do not obtain the gaze position on a monitor. 
In addition, if 3D rotations and translations of the head happen simultaneously, 
they cannot estimate the accurate 3D motion due to the increase of complexity 
of least-square fitting algorithm, which requires much processing time. Rikert 
et al.[8]’s method has the constraints that the distance between a face and the 
monitor must be kept same for all training and testing procedures and it can 
be cumbersome to user. In the methods of [10] [12] [13] [15] [16], a pair of glasses 
having marking points is required to detect facial features, which can be incon- 
venient to a user. The methods of [2] [3] [20] shows the gaze detection by head 
movements, but have the limits that the eye movements do not happen. The 
method of [19] shows the gaze detection by head and eye movements, but uses 
only one wide-view camera, which can capture the whole face of user. However, 
the image resolution is too low with such a camera and the fine movements 
of user’s eye cannot be exactly detected. So, we implement the gaze detection 
system with dual camera(a wide-view and a narrow- view camera). In order to 
detect the positions of user’s eye changed by head movements, the narrow-view 
camera has the functionalities of auto focusing/pan/tilting based on the detected 
3D facial feature positions from the wide-view camera. In addition, we use IR- 
LED illuminator in order to detect facial features and especially eye features. To 
detect the exact eye positions in case of users with glasses, we use dual IR-LED 
illuminators. In section 2, I explain the method for extracting facial features us- 
ing our gaze detection system. In section 3, I show the method of estimating the 
3D facial feature positions and capturing eye image by narrow view camera. In 
section 4, the method for estimating 3D head rotation and translation is shown 
and the method for detecting final gaze position is explained in section 5. In 
section 6, the performance evaluation is provided and the conclusion is shown in 
section 7. 

2 Extraction of Facial Features 

In order to detect gaze position on a monitor, we first locate facial features(both 
eye centers, eye corners, nostrils and lip corners) in an input image. There have 
been so many researches for detecting face and facial features. One of them 
is to use facial skin color [22], but their performance may be affected by the 
environmental light or race, etc. To overcome such problems and detect the 
facial features robustly in any environment, we use the method of detecting 
specular reflection on the eyes [19]. It requires a camera system equipped with 
some hardware as shown in Fig.l. 

In Fig.l, the IR-LED(l) is used to make the specular reflections on eyes[19]. 
The HPF(2)(High Pass Filter) in front of camera lens can only pass the infrared 
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(6) Auto Focusing Narrow View Camera 
Inciuding High Pass Filter. 



(c) Infrared Light 
(Over 800nm) 



(b) Visible Light 
&Ultraviolet Rays 




(5) Auto Pan & Tilting 

(3) Wide View Camera 

(4) Micro- Controller 



(2) High Pass Filter 
(Passing Over 800nm) 



(1) Dual IR_LED(880nm) 
for Detecting Facial Features 



(a) Visible Light 
&Ultraviolet Rays 
&lnfrared Light 



Fig. 1. The Gaze Detecting Camera with Dual IR-LED illuminators 



lights(over 700 nm) and the input images are only affected by the IR-LED(l) ex- 
cluding external illuminations. So, it is unnecessary to normalize the illumination 
of the input image. We use a normal interlaced wide-view(3) and narrow-view(6) 
CCD sensor and a micro-controller(4) embedded in camera system which can 
detect every VD (Vertical Drive, which means the starting signal of even or odd 
field) from CCD output signal. From that, we can control the Illuminator [19]. In 
general, the normal CCD sensor shows lower sensitivity for IR-LED light com- 
pared to visible light and the input image is darker with IR-LED light. So, we 
use the several IR-LEDs(880nm) as the illuminator for detecting facial features. 
The reason of using 880nm as illuminator is that general human eye can perceive 
the visible and the near infrared light(below 880nm). So, our illuminators are not 
uncomfortable to user’s eye. When a user starts our gaze detection system, the 
starting signal is transferred into the micro-controller in camera via the RS-232C. 
Then, the micro-controller turns on the IR-LED during the even field and turns 
off it during the odd field, successively [19]. From that, we can get a difference 
image between the even and the odd image and the specular points of both eyes 
can be easily detected, because its gray level is higher than any other region[19]. 
In addition, we use the Red-Eye effect in order to detect more accurate eye po- 
sition[19] and use the method of changing Frame Grabber decoder value. The 
output signal of wide/narrow- view camera(as shown in Fig.l) is NTSC format 
and we use a 2 channel Frame Grabber to convert the NTSC analog signal into 
digital image. We have implemented the PCI interfaced Frame Grabber with 
2 decoders and 1 multimedia bridge. Each decoder can A/D convert (analog to 
digital convert) the NTSC signal from narrow-view and wide-view camera. The 
multimedia bridge chip interfaces the 2 decoder with PCI bus of computer. In 
general, the NTSC signal has high resolution(0 ^ 2^° — 1), but the ability of 
A/D converting with general decoder is low resolution(0 ^ 2® — 1). So, the in- 
put NTSC signal cannot be fully represented with such decoder and some signal 
range may be cut off. The NTSC signal in high saturated range are represented 
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as 255(2® — 1) level of image. So, both the specular reflection on eye and the some 
reflection region on facial skin may be represented as same image level(2® — 1). 
In such case, we have the difficulties to discriminate the specular reflection from 
the other reflection with image information. However, the NTSC analog signal 
level of each region is different (the analog level of specular reflection on eye is 
higher than that of other reflection. We can identify such phenomena checking 
the NTSC analog level with oscilloscope equipment) . So, if we change the decoder 
brightness setting(making the brightness value lower), then the A-D converting 
range with decoder can be shifted to the upper direction. In such case, there may 
not be the high saturated range and the specular reflection on eye and the other 
reflection can be discriminated easily. When the specular points on eye are de- 
tected, we can restrict the eye region around the detected specular points. With 
the restricted eye searching region, we locate the accurate eye center by the cir- 
cular edge matching method. Because we search the restricted eye region, it does 
not take much time to detect the exact eye center(almost 5 ~ 10 ms in Pentium- 
II 550MHz). After locating the eye center, we detect the eye corner by using eye 
corner shape template and SVM(Support Vector Machine) [19]. We get 2000 suc- 
cessive image frames(100 frames x 20 persons in various sitting positions) and 
from that, 8000 eye corner samples (4 eye corners x 2000 images) are obtained 
and additional 1000 images are used for testing. Experimental results show the 
classification error from training data is 0.11% (9/8000) and that from testing 
data is 0.2%(8/4000) and our algorithm is valid on the users with glasses or con- 
tact lens. In our experimental results, MLP (Multi-Layered Perceptron) shows 
the error of 1.58% from training data and 3.1% from testing data. In addition, 
the classification time of SVM is so small as like 13 ms in Pentium-II 550MHz[21] . 
After locating eye centers and eye corners, the positions of both nostrils and lip 
corners can be detected by anthropometric constraints in a face and SVM similar 
to eye corner detection. Experimental results show that RMS error between the 
detected feature positions and the actual positions (manually detected positions) 
are 1 pixel (of both eye centers), 2 pixels (of both eye corners), 4 pixels (of both 
nostrils) and 3 pixels (of both lip corners) in 640x480 image[19j. From the de- 
tected feature positions, we select 7 feature points (left/right eye corners of left 
eye, left/right eye corners of right eye, nostril center, left/right lip corners) [19] 
and compute 3D facial motion based on the 2D movement of them. 

3 Estimating the Initial 3D Facial Feature Positions and 
Capturing Eye Image by Narrow View Camera 

After feature detection, we take 4 steps in order to compute a gaze position on 
a monitor[2] [3] [19]. At the first step, when a user gazes at 5 known positions on 
a monitor, the 3D positions(X, Y, Z) of initial 7 feature points(selected in the 
section 2) are computed automatically. In such case, if we use more calibration 
points, the computation accuracy of initial feature positions will be somewhat 
increased, but the user’s inconvenience is also increased accordingly. At the sec- 
ond and third step, when the user rotates/translates his head in order to gaze 




626 



K.R. Park and J. Kim 



at one position on a monitor, the new 3D positions of those 7 features can be 
computed from 3D motion estimation. At the 4th step, one facial plane is deter- 
mined from the new 3D positions of those 7 features and the normal vector of 
the plane represents a gaze vector by head movements. Here, if the changed 3D 
positions of initial 7 feature points can be computed at the 2nd and 3rd step, 
they can be converted into the positions of monitor coordinate. From that, we 
can also convert those feature positions into those of camera coordinate based 
on the camera parameter which can be obtained at the 1st step. With this in- 
formation, we can pan/tilt the narrow- view camera in order to capture the eye 
image. In general, the narrow-view camera has a small viewing angle(large focal 
length of about 30 - 45mm) with which it can capture large eye image. So, if the 
user moves (especially rotates) his head severely, one of his eyes may disappear 
in camera view. So, we track only one visible eye with auto pan/tile narrow-view 
camera. For pan/tilting, we use 2 stepping motors with 420 pps(pulses per sec- 
ond). In addition, general narrow- view camera has small DOF(Depth of Field) 
and the input image can be easily defocused according to user’s Z movement. 
The DOF is almost the Z distance range in which the object can be clearly cap- 
tured in the camera image. The DOF shows the characteristics that if the size of 
camera iris is smaller, or the Z distance of object to be captured is larger in front 
of camera, the DOF is bigger. However, in our case, we cannot make the user’s Z 
distance bigger on purpose because the users sits in 50 ~ 70 cm in front of mon- 
itor in general. In addition, making the camera iris size smaller lessens the input 
light to camera CCD sensor and the input image is much darker. So, we use the 
narrow-view camera with iris size of 10mm, an auto focusing lens and a focusing 
motor(420 pps) in order to capture clearer(more focused) eye image. These auto 
pan/tilt/focusing are manipulated by micro-controller (4) in camera of Fig.l. For 
focusing of narrow-view eye image, the Z distance information between the eye 
and a camera is required. In our research, the Z distance can be computed at the 
2nd and 3rd step and we can use such information as the seed of auto focusing for 
eye image. However, the auto focusing in narrow-view camera is reported to be 
difficult due to small DOF and exact auto focusing cannot be achieved only with 
Z distance. So, we contrive a simple auto focusing algorithm which checks the 
pixel disparity in an image. That is, the auto pan/tilt of narrow-view camera is 
achieved according to the detected eye position from the wide-view camera and 
the preliminary auto focusing for eye image is accomplished based on the com- 
puted Z distance of the 3rd step. After that, the captured eye image is transferred 
to PC and our simple focusing algorithm checks the focus quality of image. If the 
quality does not meet our threshold, then we send the controlling command of fo- 
cus lens to camera micro-controller(4) in Fig.l. Here, when the defocused eye im- 
age is captured, it is difficult to determine the movements of focus lens(move for- 
ward or backward). For that, we use various heuristic information (for example, 
image brightness and blind/pyramid lens searching, etc). With this auto focusing 
mechanism, we can get the focused eye image from narrow-view camera. In this 
stage, we consider the specular reflection on glasses. The surface of glasses can 
reflect the IR-LED light into narrow-view camera and the eye image may be cov- 
ered with large reflection region. In such case, the eye region cannot be detected 
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and we cannot compute the consecutive eye gaze position. The surface of glasses 
is mainly affected by the user’s vision. The weaker the vision becomes, the flatter 
the surface of glasses may be. In such flatter surface, the reflection region tends to 
be larger in eye image and it is more difficult to detect eye region. So, we use dual 
IR-LED illuminators like Fig. 1(1). The dual illuminators turn on alternately. 
When the large specular reflection happens from one illuminator(right or left il- 
luminator), then it can be detected from image. As mentioned before (section 2), 
the NTSC analog level of specular reflection region is higher than any other re- 
gion and they can be detected by changing decoder brightness setting. When the 
large specular region proves to exist with the changed decoder brightness value, 
then our gaze detection system automatically change the illuminator (from left 
to right or right to left). In such case, the specular reflection may not hap- 
pen. With these procedures, if the focused eye image can be captured, we use a 
trained neural network(multi-layered perceptron) to detect the gaze position by 
eye’s movement. Then, the facial and eye gaze position on a monitor is calcu- 
lated from the geometric sum between the facial gaze position and the eye gaze. 
The detail explanations about the first step including the camera calibration can 
be referred to [2] and experimental results show that the RMS error of between 
the real 3D feature positions (measured by 3D position tracking sensor) and the 
estimated one is 1.15 cm(0.64cm in X axis, 0.5cm in Y axis, 0.81cm in Z axis) for 
20 person data which were used for testing the feature detection performance. 

4 Estimating the 3D Head Rotation and Translation 

This section explains the 2nd step shown in section 3. Considering many lim- 
itations or problems of previous motion estimation researches[19], we use the 
EKF for 3D motion estimation algorithm and the moved 3D positions of those 
features can be estimated from 3D motion estimations by EKF and affine trans- 
form[2][19]. Detail accounts can be referred to[l]. The estimation accuracy of 
EKF is compared with 3D position tracking sensor. Our experimental results 
show the RMS errors are about 1.4 cm and 2.98° in translation and rotation. 

5 Detecting the Gaze Position on the Monitor 

5.1 By Head Motion 

This section explains the 3rd and 4th step explained in section 3. The initial 
3D positions of the 7 features computed in monitor coordinate in section 3 
are converted into the 3D feature positions in head coordinate and using these 
converted 3D feature positions, 3D rotation and translation matrices estimated 
by EKF and affine transform, we can obtain the new 3D feature positions in 
head and monitor coordinate when a user gazes at a monitor position [2]. From 
that, one facial plane is determined and the normal vector of the plane shows a 
gaze vector. The gaze position on a monitor is the intersection position between 
a monitor and the gaze vector [2] [3] [20]. 
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Fig. 2. The neural network for detecting gaze position by eye movements 



5.2 By Eye Motion 

In section 5.1, the gaze position is determined only by head movement. As men- 
tioned before, when a user gazes at a monitor position, both the head and eyes 
can be easily moved simultaneously. So, we compute the eye movements from 
the detected iris center and eye corner points as shown in Fig. 2. Here, we use the 
circular edge detection and the eye corner template with SVM in order to detect 
iris and corner. This method is almost same to those for detecting eye position 
in wide- view camera mentioned in section 2. As mentioned before, when a user 
rotates his head severely, one of his eyes may disappear in narrow- view camera. 
So, we detect both eyes in case the user gazes at a monitor center and when 
the user rotates his head severely, we track only one eye. In general, the eye 
positions and shape are changed according to a user gaze position. The distance 
between the iris center and left or right eye corner is changed according to user’s 
gaze positions. We use a neural network(Multi-layered Perceptron) to train the 
relations between the eye movements and gaze positions as shown in Fig. 2. 

Here, the input values for neural network are normalized by the distance 
between the eye center and the eye corner, which are obtained in case of gazing 
monitor center. That is why we use only auto focusing lens for narrow-view 
camera without zoom lens. Without zoom lens, the eye size and the distance 
between the iris center and the eye corner in image plane(pixel) are affected by 
the user’s Z position. That is, the more the user approaches the monitor, the 
larger the eye size and the distance between the eye ball and eye corner become. 
In our research, the Z distances of user are varied between 50 ~ 70cm. So, the 
distance normalizations for neural network are required like Fig. 2. 
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5.3 Detecting Facial and Eye Gaze Position 

After detecting eye gaze position based on the neural network, we can locate 
final gaze positions on a monitor by both head and eye movements based on the 
vector summation of each gaze position(face and eye gaze) [19]. 

6 Performance Evaluations 

The gaze detection error of proposed method is compared to our previous meth- 
ods[2][3][18][19] like Table 1. The researches [2] [3] use the 4 steps mentioned in 
section 3, but compute facial gaze position without consider the eye movements. 
The research[18] does not compute 3D facial feature positions/motions and cal- 
culates the gaze position by mapping the 2D feature position into the monitor 
gaze position by linear interpolation or neural network. In addition, they con- 
sider only head movements without eye movements. The method[19] computes 
the gaze positions considering both head and eye movements, but uses only one 
wide- view camera. The test data are acquired when 10 users gazed at 23 gaze 
positions on a 19” monitor. Here, the gaze error is the RMS error between the 
actual gaze position and the computed ones. The reason that we use the RMS 
error as the criterion for gaze detection accuracy is that we calculate the gaze 
position on a monitor with X and Y position and the accuracy in two axises 
should be considered at the same time. Shown in Table 1, the gaze errors are 
calculated in two cases. The case I means that gaze error about test data includ- 
ing only head movements. In the meanwhile, it is often the case that the head 
and eye movements happen simultaneously, when a user gazes at. So, we tested 
the gaze error including head and eye movements in the case II. 

Shown in Table 1, the gaze error of the proposed method is the smallest in 
any case. At the second experiment, points of radius 5 pixels are spaced vertically 
and horizontally at 150 pixel intervals(2.8 cm) on a 19” monitor with 1280x 1024 
pixels. The test conditions are almost the same as Rikert’s research[8][19]. The 
RMS error between the real and calculated gaze position is 3.43 cm and it is 
superior to Rikert’s method(almost 5.08 cm). This gaze error is correspondent 
to the angular error of 2.41 degrees on X axis and 2.52 degrees on Y axis. In 
addition, Rikert has the constraints that user’s Z distance must be always the 
same, but we do not. For verification, we tested the gaze errors according to the 
Z distance(55, 60, 65cm). The RMS errors are 3.38cm in the distance of 55cm, 
3.45cm in 60cm, 3.49cm in 65cm. It shows that our method can permit the user 
to move about in the Z direction. And Rikert’s method takes much processing 



Table 1. Gaze error about test data including only head movements (cm) 
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time(l minute in alphastation 333MHz), compared to our method(about 700ms 
in Pentium-II 550MHz). Our system only requires the user to gaze at 5 known 
monitor positions at the initial calibration stage (as shown in the section 3) 
and after that, it can track/compute the user’s gaze position without any user’s 
intervention at real-time speed. 



7 Conclusions 

This paper describes a new gaze detecting method. The gaze error is about 3.44 
cm and the processing time is about 700ms in Pentium-II 550MHz. Whereas, Rik- 
ert’s method shows the RMS error of almost 5.08 cm and it takes processing time 
of 1 minute in alphastation 333MHz. Our gaze detection error can be compen- 
sated by the additional head movement (like mouse dragging). In future works, 
we have plans to research more accurate method of detecting eye image and it 
will increase the accuracy of final gaze detection. In addition, the method to 
increase the auto panning/tilting/focusing speed of narrow view camera should 
be researched to decrease total processing time. 
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Abstract. A novel methodology for the histological images characterisation 
taken from the microscopic analysis of cervix biopsies is outlined. First, the 
fundament of the malignancy process is reviewed in order to understand which 
parameters are significant. Then, the analysis methodology using equalisation 
and artificial Neural Networks is depicted and the step by step analysis output 
images are shown. Finally, the results of the proposed analysis applied to 
example images are discussed. 



1 Introduction 

Cervical Uterine Cancer (CUC) is the most common type of cancer in women at 
reproductive age, in Mexico, where around 4,300 deceases were recorded in 2001 
alone [1] and it represents a serious public health problem worldwide. Enormous 
effort has been dedicated towards designing adequate diagnosis techniques in order to 
detect CUC in its early stage and there are massive campaigns to apply diagnosis 
tests. The challenge is not only having a reliable testing technology, but also a simple 
and inexpensive in order to be used in a massive scale. Accordingly, the aim of this 
work is to develop a practical, low-cost tool that allows measuring the 
nucleus/cytoplasm ratio (N/C) a long the epithelium layer, to help distinguish normal 
tissue from abnormal. First, the fundamental medical concepts are reviewed to 
provide a clear idea about the parameters involved in pathological images analysis. 
Then, the method developed is described in detail and, finally, some actual results on 
real cases are explained, as well. 



2 Medical Background 

2.1 Epithelium Structure 

Different layers known as basal, parabasal, intermediate and superficial are typical of 
a healthy cervix epithelium. The cervix is the lower part of the uterus and is often 
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Fig. 1. Schematic diagram of the cervix epithelium layers. 




called the neck of the cervix. The epithelial cells are produced in the basal layer and 
they move through the superficial layer in about 15 days. For this reason, when a 
biopsy (small sample of cells) is analysed, a view of the epithelium cells evolution 
along the time is shown. As the cells mature, the cell nucleus get smaller and the 
cytoplasm amount increases. The parabasal, intermediate and superficial layers are 
the areas of the images where the mathematical analysis will be focused. These 
structures are shown in Figure 1 . 



2.2 Cervical Uterine Cancer 

Although cells in different parts of the body may look and work differently, most of 
them repair and reproduce by themselves within the same way. Normally, this 
division of cells takes place in an orderly and controlled manner. If, for some reason, 
the process gets out of control, the cells will continue to divide, developing into a 
lump that is called a tumour. Tumours can be either benign or malignant. A malignant 
tumour is characterised by uncontrolled growth, alterations of varying extent in the 
structural and functional differentiation of its component cells, and the capacity to 
spread beyond the limits of the original tissue. 

The cue can take many years to develop. Before it does, early changes occur in 
the cells of the cervix. The name given to these abnormal cells, which are not 
cancerous but may lead to cancer is Cervical Intra-epithelial Neoplasia (CIN). This is 
not precisely a cancer, but frequently woman can develop it into cancer over a number 
of years provided it is left untreated. Some doctors call these changes pre-cancerous, 
meaning that the cells have the potential to develop into cancer. Thus, The CIN 
occurs only when the cells lose their normal appearance. When the abnormal cells are 
looked under the microscope, they may be divided into three categories, according to 
the thickness of the cervix epithelium affected, namely: 

• CIN 1 - only one third is affected and is called mild dysphasia. 

• CIN 2 - two thirds is affected and is called moderate dysphasia. 

• CIN 3 - the full thickness of the cervix epithelium is affected, it is referred as 
severe dysphasia (frank cancer that has not invaded the surrounding tissues). 

The CIN 3 is also known as carcinoma-in-situ. Although this may sound like 
cancer, CIN 3 is not strictly a cervix cancer, but it must be treated as soon as possible. 
The progression of CIN from one stage to the next takes years and, in some cases of 
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Normal Moderate displasia Carcinoma-in-situ HPV infection 

(a) (b) (c) (d) 

Fig. 2. Normal epithelium a, moderate displasia b, carcinoma-in-situ c and HPV infection d. 



CIN 1 may even go back to normal tissue. However, as they are part of a progressive 
disease, all cases of abnormal smears should be investigated and cases of CIN2 and 
CIN3 must be treated. [1-2]. Schematic samples of different epithelium alterations as 
moderate displasia, carcinoma-in-situ and the HPV infection compared with a normal 
epithelium are shown in Figure 2. 



3 Analysis Technique 

The approach proposed is based on the classification of the cellular structures 
obtained from biopsy microscopy images and then, its digital analysis over defined 
areas. An efficient neural network approach is selected and used to classify benign 
and malignant structures, based on the extracted morphological features. This 
technique consists of the identification of pre-malignant conditions, which may 
progress to malignancy. Friendly and easy-to-use software in order to help the 
pathologist on the diagnosis of cervix cancer was developed. The software input 
consists of microscopy images taken from the cervix hiopsy stained by the standard 
procedure. The software performs a quantitative analysis on the nucleus/cytoplasm 
ratio and the structural analysis of the cellular tissue at its different layers. 



3.1 Neural Networks 

The first problem finding out the hiopsy image structures is to classify the pixels 
according with its colour characteristics. The classification problem requires labelling 
each pixel as a belonging to one of “n” classes (nucleus, epithelial cytoplasm, sub- 
epithelial cytoplasm and white zones). 

Artificial neural networks can separates the classes by a learning process that 
gradually adjust a parameters set of a discriminant function and it is the heart of the 
image analysis process. 

When a plane can separate two classes, the classes are said to be linearly separable 
and a neural network without hidden units or layers can learn such problem. This 
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property can be applied to our classification problem because the stain used in the 
biopsy allows colorizing the epithelium structures substantially different. 

For multinomial classification problems, a neural network with n outputs, one for 
each class, and target values of 1 for correct class, and 0 otherwise, is used. The 
correct generalisation of the logistic sigmoid to the multinomial case is the Softmax 
activation function; 



Xi 

y,(d:) = =^ i=l,2,...,C (1) 

c 

where yj(x) is the activation function of the i"" output node and C is the number of 
classes. Notice that yj(x) is always a number between 0 and 1. 

The error function is defined as: 

c 

£ = ^f.ln(yp (2) 

j=i 



Equation 2 is the so-called Cross-Entropy error, where t is the target; y^ is the 
output “j”. 



dE 



= {t,-yr)x, 



( 3 ) 



W{new) = W{old) + ju{t - y)x (4) 

Equation 3 represents the error change rate when the weights are altered; Equation 
4 allows to get the new weights W(new) in terms of the olds weights W(old) and |J, is 
the learning rate between 0 and 1. Since all the nodes in a Softmax output layer 
interact. The output value of each node depends on the values of all the others. 



Preconditioning Network Criteria. There are two main factors to consider during 
the learning process of the neural network: 

• If |J, (the learning rate) is too low, convergence will be very slow; set it too high, 
and the network will diverge. The ill conditioning in neural networks can be caused 
by the training data, the network’s architecture, and initial weights. The ill 
conditioning can be avoided by using preconditioning techniques. 

• Inputs and targets normalization. To normalize a variable, first subtract its average 
and then, divide it over its standard deviation. 

Before training, the network weights are initialised to small random values. The 
random are usually chosen from a uniform distribution over the range [-r,r]. This type 
of learning is referred to as “supervised learning” (or learning with teacher) because 
target values are taken from known images structures. In this type of supervised 
training, both the inputs “x, “and outputs “t “are provided. The network then processes 
the inputs and compares its resulting outputs against the desired outputs. The error 
function is then calculated by the system, causing the system to adjust the weights, 
which control the network. Sets of pixels values are taken from a known image 
structure (reference image). The pixel values are used as the inputs or the decision 
values, and the output structures in the image structures are established as classes. 
There will be a values range for the decision values that map to the same class. If the 
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Fig. 3. Neural network structure used by the proposed method. 

values of the decision variables are plotted, different regions or point clusters will 
correspond to different classes [3-5]. 

A single layer network was selected for the network topology and the so-called 
Perceptron algorithm trains it. The selected topology is shown in Figure 3. 

One layer, 5 inputs, 4 nodes, activation function Softmax, error function cross- 
entropy, type of learning algorithm perceptron are the complete neural network 
specifications. The five inputs are conformed by RGB pixel’s characteristics and an 
input constant ”k”. 




Perceptron algorithm. The Perceptron algorithm is a step-wise method, which 
allows finding out the weights set that can classify appropriately the image pixels. 
The steps are the following: 

1 . Initialise the weights with small random values 

2. Pick a learning rate p as a number between 0 and 1 

3. Compute the output activation for each training pattern by the Equation 1 

4. Compute the error function by the Equation 2 

5. Updating the weights W by the Equation 4 until stopping condition is satisfied (a 
specific error function value) 

It is important that the step four considers all the pixels set from all structures or 
classes and provides them to the algorithm in random order to assure an appropriate 
algorithm convergence. The yj(x) is interpreted as the probability that “i” is the correct 
class. This means that: 

• The output of each node must be between 0 andl. 

• The sum of the outputs over all nodes must be equal to 1 . 

In other words, yj(x) values indicates the probability that a particular pixel belongs 
to nucleus Y^, epithelial cytoplasm Y<,j, sub-epithelial cytoplasm or white zones 
Y^^, structures. Once the neural network is trained, it has the ability to predict the 
output for an input that has not be seen and this is called “generalization”. [6-11] 
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4 Practical Image Analysis Procedure 

Two operation modes are considered: 



Learning mode. The software learns the reference values that will be used by the 
neural network in the image analysis. This is done only once. A reference image is 
selected and it is analysed in order to get the basic parameters used later in the image 
possessing. 

Two parameter sets are considered in this stage: 

• Colour deviations usually are produced by differences in the stain procedure and 
by differences in the slide illumination at the image acquisition process. An 
equalisation process helps to reduce the colour deviation between images. One 
image is selected as a reference and its RGB colour histogram is taken separately. 
The digital value of the red colour pixels of the image, for example goes from 30 to 
240, green goes from 25 to 218 and blue from 10 to 250. These values are taken as 
reference parameters (RL, RH, GL, GH, BL and BH,) and are used to modify the 
respective levels of further images. 

• Little samples images from the reference image are taken. Samples of pixels from 
the nucleus, cytoplasm and white zones are normalized and used as inputs and 
outputs (x,t) array when the neuronal network is trained. Thus, the final weights 

are obtained applying the perceptron algorithm. 



Normal mode. When the pathologist selects an image, the image processing is 
started. Figure 4 shows a typical example of a normal biopsy. The images most have 
all of structures shown in Figure 1 and preferably at the same arrange position. 

a).- The first step is the equalisation process using the parameters obtained from 
the learning stage. This is done by a linear conversion for each pixel and for its RGB 
component colour. By applying this conversion, a new image is built. Tree equations, 
like the equation 5, for the equalisation process are used. 




Fig. 4. A biopsy Image taken from the microscope digital camera. 
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Fig. 5. Histogram transformation by the equalization process. 




Fig. 6. Image after the equalization process. 



RH-RL 

( 5 ) 

HRH - HRL] 

Where PRNe„ is the new value of the red component for each pixel, PR(,,j is the old 
value of the red component. RH and RL are the higher and lower components of the 
red histogram taken from the reference image. HRH and HRL are the higher and 
lower components of the red histogram taken from the image to be processed. Similar 
equations are used for the green and blue components of the pixel transformation. The 
equalization process produces a change in the histogram and it is represented in 
Figure 5. The new transformed image is shown in Figure 6. 

b).- Using the neural network weights, as obtained from the learning mode, the 
program builds a new image where each pixel is classified into four categories: 
nucleus, epithelial cytoplasm, sub-epithelial cytoplasm and white zones or external 
zone. Four different grey levels are assigned to each zone as the new image is built. 
The structures classified from the image in Figure 6 are shown in Figure 7. 



PR,^„=(RH-RL) + PR,,, 
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Fig. 7. Image transformed is classified in four different grey levels. 




Fig. 8. Schematic diagram that shows how the analysis window moves over the image in order 
to find out the epithelium limits a) and how the epithelial layer limits are found b). 

c) .- The epithelium zone is then established using a moving rectangular window, 
which helps to find where the epithelium begins and ends. The window is first moved 
vertically and in the horizontal path as is shown in Figure 8. The central window point 
is evaluated in order to find out the epithelium limits. Within a rectangular window, 
nucleus (N), epithelial cytoplasm (Cl), sub-epithelial cytoplasm (C2) and white zones 
(W) structure areas or number of pixels for each zone inside the window are 
computed. 

If the sign of [C1-C2] changes when the window moves vertically and Cl 0 and 
C2 0 then, the beginning of the epithelium edge is found and drawn over the image. 
If the sign of [W -(N-i-C1-hC 2)] change then the external limit is also found. A view of 
one screen output of the software, showing the epithelial layer limits can be seen in 
Figure 9. 

d) .- The nucleus/cytoplasm ratio (N/Cl) and white halos/cytoplasm ratio (W/Cl) 
are evaluated only in the epithelial layer and plotted. Selecting an area over the image 
does this by means of a rectangular window. The N/Cl ratio of a normal epithelium 
has an exponential behaviour that is also plotted and used as a reference. The areas 
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Fig. 9. A view of one screen software output showing an epithelial typical analysis. 




Fig. 10. Normal and measured the Nucleus/Cytoplasm ratio along the epithelium thirds. 

where the N/Cl have abnormal behaviour are contrasted in order to provide a warning 
signal to the pathologist. An example of this output is shown in Figure 10. 

The computer program was developed on Delphi language and it runs on windows 
platform. It was conceived as an easy tool for pathologists. The system has a digital 
camera coupled to microscope and a personal computer. The software allows loading 
images from files and saving or printing the analysis results. The user interface 
provides a selection window on top of the image that gives the numerical or graphical 
nucleus/cytoplasm ratio for any selected area. The microscope should have the 
magnification power fixed at lOX preferably, in order to cover a large epithelium 
area. The image digital resolution should be such that nucleus diameter average pixels 
size being around 10 pixels and must be save in bitmap file type. 

When the neural network is trained, one hundred of teaching interactions are 
enough to reach an error magnitude of 10'". 
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5 Conclusions 

Around 30 different images were tested with satisfactory results and the effectiveness 
of the image analysis proposed was demonstrated. It is mandatory that the images 
have the complete epithelium basic structures in order to assure reliable results. 

The results indicate that the use of intelligent computational techniques along with 
image densitometry can provide useful information for the pathologists. It can provide 
quantitative information that may support the diagnostic reports. 

Although the developed software is easy to use and provides valuable information 
about the histological images, it is at laboratory prototype stage. Novel algorithms 
have been developed as a nucleus size measurement and the basal line is analysed in 
order to find out if the malign cells infiltrate it. 

Details of the software are available from the authors, upon request. 
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Abstract. Three-dimensional (3D) registration is the process aligning the range 
data sets form different views in a common coordinate system. In order to 
generate a complete 3D model, we need to refine the data sets after coarse 
registration. One of the most popular refinery techniques is the iterative closest 
point (ICP) algorithm, which starts with pre-estimated overlapping regions. 
This paper presents an improved ICP algorithm that can automatically register 
multiple 3D data sets from unknown viewpoints. The sensor projection that 
represents the mapping of the 3D data into its associated range image and a 
cross projection are used to determine the overlapping region of two range data 
sets. By combining ICP algorithm with the sensor projection, we can make an 
automatic registration of multiple 3D sets without pre-procedures that are prone 
to errors and any mechanical positioning device or manual assistance. The 
experimental results demonstrated that the proposed method can achieve more 
precise 3D registration of a couple of 3D data sets than previous methods. 



1 Introduction 

Range imagery is increasingly being used to model real objects and environments, 
and the laser sensors have simplified and automated the process of accurately 
measuring three-dimensional (3D) structure of a static environment [1]. Since it is not 
possible to scan an entire volumetric object at once due to topological and geometrical 
limitations, several range images showing only partial views of the object must be 
registered. Therefore, registration to align multiple 3D data sets in a common 
coordinate system is one of the most important problems in 3D data processing. For 
the registration process, each input data set consists of 3D points in the camera’s local 
coordinate system. In order to register all input sets, a local coordinate of each 3D 
data set is transformed to a common coordinate, and the transformation between two 
data sets can be represented with a homography matrix. More specifically, the process 
provides a pose estimate of the input views that is a rigid body transformation with 
the six rotation and translation parameters. In general the relative sensor positions of 
several range sets can be estimated by mounting the sensor on a robot arm or keeping 
the sensor fixed and moving like an object on a turn-table [2]. 
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The registration problem is composed of two phases: coarse registration and fine 
registration. In general, the coarse process obtains a rough estimation of 3D 
transforms by using mechanical positioning devices and manual processing. In order 
to refine the 3D estimate and make a complete 3D model, the fine registration is 
applied after coarse registration. The iterative closest point algorithm (ICP) is most 
widely used as the refinement method and calculates 3D rigid transformation of the 
closest points on the overlapping regions [3]. The two main difficulties in ICP, 
determining the extent of overlap in two scans and extending the method for multiple 
scans, have been a focus of the further research [4]. Namely, ICP requires a priori 
knowledge about an approximate estimation of the transformations, so starts with pre- 
estimated overlaps. Otherwise ICP tends to converge monotonically to the nearest 
local minimum of a mean square distance metric. 

This paper presents an improved ICP algorithm that can automatically register 
multiple 3D data sets from unknown viewpoints. For a full automatic registration 
without an initial estimation process, we use the sensor projection matrix that is the 
mapping of the 3D data into its associated range image. The sensor projection matrix 
consists of the extrinsic and the intrinsic parameters as the camera projection does. 
The extrinsic parameters describe the position and the orientation of the sensor, and 
the intrinsic parameters contain measurements such as focal length, principal point, 
pixel aspect ratio and skew [5]. Since all range data is obtained with one range sensor, 
in general, the intrinsic parameters always remain unchanged. Then we use the 
covariance matrix to roughly obtain the extrinsic parameters that represent the relative 
3D transformations between two inputs. The improved ICP method iteratively finds 
the closest point on a geometric entity to a given point on the overlapping regions 
based on the sensor projections. By combining ICP algorithm with the sensor 
projection constraint, we can solve the local minimum problem. 

The remainder of the paper is organized as follows. In Sec. 2, previous studies are 
explained. In Sec. 3, an improved ICP algorithm is presented, and in Sec. 4, we 
demonstrate the experimental results and compare with previous methods. Finally, the 
conclusion is described in Sec. 5. 



2 Previous Studies 

In the last few years, several algorithms for 3D registration have been proposed and 
can be classified into the semiautomatic and the automatic methods. Semiautomatic 
approaches require manual assistance including specification of initial pose estimates 
or rely on external pose measurement systems, so they have a couple of limitations 
and setting of equipments is needed [6]. For example, a mechanical positioning 
device can only deal with indoor-sized objects and a manual assistance may be 
inaccurate. On the contrary, automatic registration is to automatically recover the 
viewpoints from which the views were originally obtained without a prior knowledge 
about 3D transformation. The main constraint of most automatic methods is that many 
pre-processes including feature extraction, matching and surface segmentation are 
required. In order to calculate the pose for arbitrary rotation and translation 
parameters, we need to know at least 3 corresponding feature points between the 3D 
data sets. Once correspondences have been established, numerical minimization is 
used to determine the object’s rotation and translation [7]. However, automatically 
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detecting suitable features and matching them are very difficult, and currently no 
reliable methods exist. Furthermore, another approach is to ask the user to supply the 
features, but this is very labor intensive and often not very accurate. 

From the viewpoints of constructing a complete 3D model, the registration 
problem is divided into the coarse registration and the fine. In the coarse process we 
use usually mechanical positioning devices or manual processing to obtain a rough 
estimate of 3D transforms. B. Horn proposed a closed-form solution to find the 
relationship between two coordinate systems of 3D points by using a unit quaternion 
from covariance matrix [8]. In addition, a refinery technique is needed to improve the 
3D estimate and make a complete 3D model. After that P.J. Besl., et. al. presented 
ICP that optimizes 3D parameters based on Horn’s method by using the closest points 
matching between two sets [3]. 



3 Proposed Algorithm 

This paper presents an improved ICP algorithm for automatic registration of multiple 
3D data sets without a prior information about 3D transformations. The proposed 
iterative method uses the eigenvector of the covariance matrix and the sensor 
projection in an initial estimation. The eigenvector represents the direction of an 
object and defines a new axis at the centroid of the object. The analysis of the 
eigenvectors provides the relative sensor position in a common coordinate system. By 
using a cross projection based on the senor position, the overlapping regions can be 
detected. Finally, the improved ICP method iteratively finds the closest point on a 
geometric entity to a given point on the overlapping regions, and refines the sensor 
position. 



3.1 Finding Overlapping Regions by Cross Projection 

If an initial pose of the object differs so much from the real one, generally, it is 
difficult to construct a precise model due to self-occlusions. The more overlapping 
regions in 3D data sets we have, the more precise registration can be performed. 
Therefore, we assume that multiple-view range images have significant overlaps with 
each other [9]. By using an initial estimation of the relative sensor position and the 
sensor projection constraint, the proposed method finds the overlapping regions 
between two range data sets. The overlaps, which can be detected by two sensors at a 
time, are located in both 3D data sets. Fig. 1 shows overlapping regions on the first 
range data sets {R^ at the sensor 5j and the second {R^ at S^. 

The overlapping regions on R^ are measured at the second sensor pose (S^), and 
those on R^ are also measured at the first (Sj). So we can define the overlapping 
regions according to the visibility from two viewpoints. In order to find the 
overlapping regions, we propose a cross projection method that projects the first data 
set into the second sensor position S^, and R^ into Sj, respectively. Our method 
examines whether there are occlusions (scan errors) and self-occlusion. By analyzing 
the relation of the sensor direction vector and the vertex normal vector, we can find 
the overlapping regions. For example, when the angle between the sensor direction 
and the vertex normal vector (V„) is lower than 90 degree, it is possible to scan the 
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point by the sensor. Otherwise, we determine that the vertices are occluded, and those 
are not detected by the sensor. 



(Sj - Vj) ■ > 0: Overlap region in the (1) 

(5j - V^) ■ > 0: Overlap region in the , (2) 



where Vj and are vertex in R^ and R^, respectively. In short, the overlap in R^ is 
found by and that in the R^ is by 5j. 

In the case of a concave object, there may be self-occlusion regions in the 3D data 
sets. As described in Fig. 2, the self-occlusion vertices are projected to the same pixel 
of the second range image hy the second sensor projection. In this case, we examine 
the distance of the sensor position between vertices, and select the closest vertex from 
the sensor. The proposed cross projection can exclude the occlusion vertices and the 
self-occlusions, and find the overlapping regions between two views. 




Fig. 1. Overlaps between two range sensors 



3.2 Sensor Projection Matrix and Sensor Position 

The sensor projection matrix is almost similar to the camera projection matrix. In 
general, the range sensor provides 3D range data sets (X) and the range image (jc) 
corresponding to the range data sets. We can compute easily the sensor projection (P) 
and the sensor pose (S) using n corresponding points between range image and 3D 
range data sets [5]. The process is summarized as follows: 

a. For each correspondence (x and X), Aj matrix (2 X 12) is computed. 

b. Assemble n of Aj into a single A matrix (2n x 12). 

c. Obtain the SVD (Singular Value Decomposition) of A. A unit singular vector 
corresponding to the smallest singular value is the solution p. Specifically, if A 
= UDV^ with D diagonal with positive diagonal entries, arranged in descending 
order down the diagonal, then p is the last column of V. 

d. The P matrix is determined from p, and the sensor pose (S) is computed as 
follow: 

S=M^p, , 

where p^ is the last column of P, and P = M[ 1 1 M 'p^ ] = KR[ 1 1 -S ]. 



( 3 ) 
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Fig. 2. The overlaps by cross-projection, (a) 1st range data from 1st sensor (b) 1st range data 
from 2nd sensor: the overlaps, occlusion and self-occlusion through the 2nd sensor and 1st 
range data 



Since a range data is defined in its local coordinate system, all initial sensor 
positions obtained from multiple data sets are almost constant. For registration, we 
should find an accurate sensor pose in a common coordinate system. An accurate 
initial estimate can reduce the computation load and make the result of registration 
more reliable. By using a unit quaternion from closed-form covariance with the given 
corresponding point pairs in the two data sets, Horn suggested a method for 
registration of two different coordinate system [8]. Although our method is 
mathematically similar to Horn’s, we do not use the corresponding point pairs, which 
are usually obtained in pre-processing or geometric limitation of 3D views. The 
proposed algorithm uses a similarity of 3D range data sets of an object from different 
views instead of feature correspondences. More specifically, we use the orthonormal 
(rotation) matrix that can be easily computed by eigenvalue decomposition of the 
covariance matrix instead of a unit quaternion. The eigenvector of the covariance 
matrix provides the major axis and the minor of 3D point clouds, so that it defines a 
new coordinate of the object. Three eigenvectors of the covariance matrix represent x, 
y, and z axes of the 3D data set, respectively. The obtained covariance matrix is used 
to define the object’s local coordinate, and the centroid of the data sets is an origin of 
a new coordinate system. 

In the first stage, the centroid (C) of each range data sets is calculated as follows: 



(4) 



1 N-l 

c = — y V, ’ 

Nj:‘o ^ 

where V and N represent 3D vertex in the range data sets and the number of vertices, 
respectively. In addition, we compute the covariance matrix (Cov) of each range data 
sets as follows: 



^ j=o 

Let, CoVj and Cov^ be covariance matrices of two range data sets (R^ and R^) 
respectively. We find two object coordinates by using eigenvalue decomposition of 
both covariance matrixes from R^ and R^. 



Covi = UiDiUi^ 
Cov2 = , 



(6) 
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where the diagonal matrices (D) and orthonormal matrices (U) represent eigenvalue 
and eigenvector of covariance matrices, then U provides a new object’s coordinate. 
and are defined again in the new coordinate by Uj, and the centroids of two 
range sets (Cj, C^). In addition, a rigid transformation (7) is found by Uj and Uj, and 
an initial relative sensor position is approximated by the rigid transformation (7). 



UiUj"’ Cj-Ci'. 

. 03 1 _ 



(7) 



3.3 3D Registration by Improved ICP Algorithm 

In general, ICP algorithm finds a rigid transformation to minimize the least-squared 
distance between the point pairs on the pre-determined overlapping regions. By using 
the sensor constraints and the cross projection, we can define the overlaps from the 
transformed data sets and the sensor positions, and then calculate 3D rigid 
transformation of the closest points on overlaps. More specifically, two range data 
sets are cross -projected into an initial position of the sensor, so an overlapping region 
is found. On the overlaps we find the closest point pairs, and calculate the 
transformations that can minimize the square distance metric between the points. The 
obtained transformations are used to optimize the initial sensor position for a more 
precise location. Our iterative method repeats the estimation of the sensor position 
and the detection of the overlapping regions. This process is repeated until the 
distance error value of closest point pair is minimized (Eq. 8), and we can optimize 
the sensor pose and 3D transformations of the range data. 

E = ^ ||Vii-/{(V2.--C2)-r)|| ’ (8) 

1=1 

where Vj and are the closest point pairs of overlaps in two range sets and is the 
centroid of V^. Rotation parameters (R) is found by eigenvalue decomposition of two 
covariance matrices and translation (7) is the displacement between the centroids of 
the points Vj and V^. Our method automatically finds the closest point pairs (Vj and 
Vj) on the overlapping regions between two 3D sets from unknown viewpoints. Fig. 3 
provides the block diagram of the proposed algorithm. 



4 Experimental Results 

We have demonstrated that the proposed algorithm can more precisely and efficiently 
register 3D data sets from unknown viewpoints than ICP method. The proposed 
algorithm has been tested on various 3D range images, which contain from 60K to 
lOOK data points. The simulation is performed on PC with Intel Pentium 4 1.6GHz. In 
order to compare performance of each method, we use the virtual range scanning 
system and the real laser scanner with the calibrated turntable. We fixed the virtual 
sensor and acquired the “COW” range data sets by rotating the object with h- 60 
degrees along Y-axis. “FACE” data is actually acquired by Minolta Vivid 700 laser 
scanner and the calibrated turntable with the same degrees. Consequently, we can 
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compare precisely performance of the previous method with that of the proposed 
because 3D transformations are known. 




Fig. 3. Block diagram of the proposed method 







Fig. 4. Registration results of “COW” data sets, (a) Input range data sets (b) initial pose (c) 
registration by ICP (d) registration by improved ICP 



Fig. 4 and 5 show the experimental results on two range data sets. As shown in 
Table 1, our method can make a precise registration without a precision milling 
machine or a prior 3D transformation between views. ICP algorithm computes 3D 
parameters of the closest points on the surface regions based on the sensor projection. 
On the contrary, the proposed method finds firstly the overlapping regions, and the 
points on these regions are considered. Therefore, the computation time for iteration 
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Fig. 5. Registration results of “FACE” data sets, (a) Real object (b) range data sets (c) initial 
pose (d) (e) registration by improved ICP (f) distance errors of the closest point pairs 



Table 1. The experimental results on two range data sets 







Previous ICP 


Proposed Method 


c 

o 

w 


Iterations (times) 


93 


15 


CPU time (sec) 


1867 


87 


Rotation parameter 
Translation parameter 


e„:-2.392, 0,: 58.95, 0,: 0.709 
T : -0.352, T : 0.036, T : 0.047 


0,: -0.052, 0,,: 59.91, 0,: 0.067 
T.: 0.109, T„: -0.12, T,: 0.945 


Registration Error 


0„: 2.392, 0^,: 1.051, 0,: -0.709 
f.: 0.352, t’: -0.036, x'.: -0.047 


0„: 0.052, 0^: 0.09, 0,: 0.067 
t'.: -0.109, T„: 0.12, x',:-0.945 


F 

A 

C 

E 


Iterations (times) 


92 


87 


CPU time (sec) 


694 


123 


Rotation parameter 
Translation parameter 


0„: 5.624, 0^: 15.77, 0,: 7.469 
t'.:-4.31, T„: 1.173, t',: 1.895 


0„: 0.125, 0^,: 60.07, 0,: -0.100 
t'.: 0.224, t’: -0.03, x'.: -0.142 


Registration Error 


0„: -5.624, 0^: 44.23, 0.: -7.469 
T,:4.31, T: -1.173, T,: -1.895 


0„: -0.125, 0,,: 0.07, 0,: 0.100 
X : -0.224, X ,: 0.03, X: 0.142 



in ICP is much longer than that in the proposed method. ICP converged to the local 
minimum of a mean square distance metric on “FACE” data, so the results on 
“FACE” data have much more errors than those on “COW”. In the results by the 
proposed, the iteration times of “COW” is much less than those of “FACE”, because 
“COW” has more obviously the major and the minor axis than “FACE”. The results 
by the improved ICP show very small errors in each axis as shown in table 1, and 
these can be further vanished through an integration process. The proposed method 
obtains the major axis and the minor axis of 3D data sets to analyze the relative 
transformation, so it is difficult to register a totally spherical object. 

Feature point extraction algorithm and ICP are generally combined to make an 
automatic registration of 3D data sets from unknown viewpoints. In this paper, the 
spin image is used as feature point extraction for registration [10]. Fig. 6 shows the 
comparison of the registration results by three methods. Because the overlaps between 
two data sets in “WATCH” are too small, it is difficult to find the corresponding 
points. As shown in Fig. 6 (a), only the proposed method accurately registered two 
data sets. The shape of “COPTER” has a bilateral symmetry, so the spin image is hard 
to establish the correspondence between two views. On the other hand, the proposed 
method computes the major and the minor axis of the object, and can cope with its 
symmetry in Fig. 6 (b). Since “DRIVER” has little overlap regions and its hilt is 
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Fig. 6. Comparison of previous methods and the proposed method 



cylindrically symmetric, the spin image is impossible to match the points between 
views. The shapes of hilt in “DRIVER” from different viewpoints are almost alike, so 
our method is difficult to determine uniquely the minor axis of the hilt Fig. 6 (c) 
shows both methods failed to achieve a precise 3D registration. 



5 Conclusions 

This paper presents an improved ICP algorithm that can automatically register 
multiple 3D data sets from unknown viewpoints without the preliminary processes 
including feature extraction and matching. The proposed method uses the sensor 
projection and the covariance matrix to estimate an initial position of the sensor. By 
using the cross projection based on the obtained senor position, we can find the 
overlapping regions. Finally, the improved ICP algorithm iteratively finds the closest 
point on a geometric entity to the given point on the overlapping regions, and refines 
the sensor position. The experimental results demonstrated that the proposed method 
can achieve a more precise 3D registration than previous methods. Further research 
includes a study on application to cylindrical or spherical objects. In addition, we will 
create photorealistic scene through 2D/3D alignment by using projections such as 
texture mapping. 
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Abstract. Camera pose and scene geometry estimation is a fundamental 
requirement for match move to insert synthetic 3D objects in real scenes. In 
order to automate this process, auto-calibration that estimates the camera 
motion without prior calibration information is needed. Most auto-calibration 
methods for multi-views contain bundle adjustment or non-linear minimization 
process that is complex and difficult problem. This paper presents two methods 
for recovering structure and motion from handheld image sequences: the one is 
key-frame selection, and the other is to reject the frame with large errors among 
key-frames in absolute quadric estimation by LMedS (Least Median of Square). 
The experimental results showed the proposed method can achieve precisely 
camera pose and scene geometry estimation without bundle adjustment. 



1 Introduction 

Computer vision techniques have been applied for visual effects from 1990’s, and 
image composition that combines an actual shot image with a virtual object is a 
representative research area. Applications involving synthesis of real scenes and 
synthetic objects require image analysis tools that help in automating the synthesis 
process. One such application area is match move in which the goal is to insert 
synthetic 3D objects in real but un-modeled scenes and create their views from the 
given camera positions so that they appear to move as if they were a part of the real 
scene. [1] For stable 3D appearance change of the object from the given camera 
position, 3D camera pose estimation is needed. At the same time, an accurate 3D 
structure of the scene is used for placement of the objects with respect to the real 
scene. In order to automate this process, reliable camera pose estimation and 3D scene 
geometry recovery without prior calibration knowledge are necessary. This paper is a 
study on automated end-to-end multi-view pose and geometry estimation that work 
with auto-calibration. 

Multi-view pose and geometry analysis has attracted much attention in the past few 
years [2,3,4]. H. S. Sawhney represents the method to estimate an accurate relative 
camera pose from the fundamental matrix over extend video sequence [1]. M. 
Pollefeys proposes 3D modeling technique over image sequence from handheld 
camera, and then extends that for AR-system [4,5]. S. Gibson describes an improved 
feature tracking algorithm, based on the widely used in KLT tracker, and presents a 
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robust hierarchical scheme merging sub-sequence together to form a complete 
projective reconstruction, then finally describes how RANSAC based random 
sampling can be applied to the problem of self-calibration [6]. After projective 
reconstruction process, however, most algorithms require bundle adjustment or 
nonlinear minimization that is a very complex and difficult problem. 

This paper presents two sampling methods to estimate camera positions and scene 
structure from video sequences. The first step is key-frame selection. Reconstruction 
of 3D structure is achieved by first selecting a set of key-frames with which to build 
small, sub-sequence reconstructions. Key-frame selection has several benefits. The 
most important benefit is that 3D camera pose estimation and 3D scene geometry 
recovery processing, which are the relatively expensive processes, can be performed 
with smaller number of views. Another benefit is that video sequences with different 
amounts of motion per frame become more isotropic after frame decimation [7]. The 
second step is to reject the frame with large errors among key-frames in the absolute 
quadric estimation by LMedS (Least Median of Square). LMedS algorithm chooses 
from the entire tested hypothesis the one with least median squared residual on the 
entire absolute quadric sets [8]. In order to upgrade the projective structure to metric 
reconstruction in auto-calibration, an unknown projective transformation can be 
acquired by decomposition of absolute quadric [3]. Absolute quadric estimation can 
achieve a precise upgrade to metric reconstruction. This paper presents a novel 
approach to auto-calibration, by using LMedS based random sampling algorithm to 
estimate absolute quadric. A schematic diagram (Fig. 1) shows the workflow based on 
two steps for 3D match move. 

To begin with. Sec. 2 describes auto-calibration using absolute quadric, and Sec. 3 
discusses key-frame selection. After details of our LMedS based on absolute quadric 
estimation are given in Sec. 4, we show the experimental results for synthetic and real 
scenes in Sec. 5. Finally, the conclusion is described in Sec. 6. 



Key-frame Selection 



-a 



c 



2”‘‘ Sampling 



1“ Sampling 



Un-calibrated Image Sequences 



LMeds based Absolute 
Quadric Estimation 







Fig. 1. Two steps for camera pose and 3D scene structure estimation 



2 Auto-calibration 

The resulting projective structure and motion are given up an unknown projective 
transformation. To upgrade the projective structure to Euclidean reconstruction, 
traditional methods first calibrate a camera by using an object with a calibration 
pattern. Then, a metric structure of the scene can be acquired from the correspondence 
between images. A well designed calibration object with known 3D Euclidean 
geometry is necessary for a precise camera calibration. Recently auto-calibration 
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algorithms have been actively researched to avoid setting of the calibration object in 
the scene because pre-procedures for calibration have a couple of limitations and 
setting of equipments. On the contrary, because auto-calibration algorithms can 
estimate the camera parameters without prior information, many methods have been 
widely studied up to present. Among them. Hey den and Astrim, Trigg and Pollefeys 
used explicit constraints that relate the absolute conic to its images [3,9], 



2.1 Projective Reconstruction 

Projective structure and motion can be computed without camera parameters when 
point matches are given from more than two perspective images [9]. For two images 
the bilinear constraint expressed by the fundamental matrix is called the epipolar 
constraint. For three and four images, the bilinear and quadrilinear constraints are 
expressed by the trifocal and quadrifocal tensor, respectively. For more image than 
four images, P. Sturm presented factorization methods that suffer less from drift and 
error accumulation by calculating all camera projection matrices and structure at the 
same time [10], The drawback of factorization methods relying on decomposition of 
matrix is that all corresponding points must remain in all views from the first frame to 
the last. To solve overcome this problem, a merging based projective reconstruction 
method is proposed [11,12]. Sequential merging algorithms are heavily dependent on 
a good initial estimate of structure, and susceptible to drift over long sequences. 
Therefore, the resultant error increases cumulatively over time. The hierarchical 
merging methods are proposed to reduce the error. The hierarchical merging has an 
advantage that the error can be more evenly distributed over an entire sequence [6]. In 
the experiment results, we have evaluated accuracy of the proposed auto-calibration 
algorithm on the sequential and the hierarchical approach. 



2.2 Auto-calibration Algorithm 

The process of projection of a point in 3D to the image plane can be represented as 
follows: 
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where T represents the transformation of coordinate systems from world to the 
camera, is the perspective projection and K is the intrinsic parameter. We can 
reconstruct a scene up to a projective transformation by using the corresponding 
points on the images. 

m = p -M =P HH~'M ■’ (2) 

pro] pro] pro] pro] 

where m denotes the point in the image, and are the projective projection 
matrix and the projective structure of the scene point corresponding to the image point 
m, respectively. The projective structure is related to the transformation H. 
Calibration is the process to find the transformation H, which can be obtained by 
decomposition of absolute quadric. Absolute quadric is estimated as follows: 
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(3) 



where Q is absolute quadric in the projective coordinate frame. 

We assume that zero-skew and unit aspect ratio, and the principle point is known. 
Then the linear equations on Q are generated from the zero entries in Eq. (3). This is 
represented as: 
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(4) 



where ( is the element of i-th row and j-th column. Absolute quadric can be 
estimated from at least three images by Eq.(4). When Q. is known, we can easily 
obtain the intrinsic parameter by using Choleski decomposition of co’. Absolute 
quadric can be decomposed by EVD (Eigen Value Decomposition) as: 

= UDU^ = , (5) 

where is absolute quadric in Euclidean coordinate frame. Zero eigenvalue of D is 
replaced by 1. Einally, from Eq. (5) Euclidean camera motion and structure are 
obtained by applying H to the projective coordinate frame [5,6]. Eig. 2 shows relation 
of projective and Euclidean coordinate. 



-vformatiuri 



Euclid Space 



Projecuvi 



Camera .i' 



jlniage'Plane 2 



Imagd Plancl 



Projective Camera 2 



A 

Camera 2 

„a:.' ' 



Fig. 2. Relation of projective and Euclidean coordinate 



3 Key-Frame Selection 

In general, the motion between frames has to be fairly small so that a precise 
correspondence can be established by using automatic matching, while significant 
parallax and large baseline is desirable to get a well-conditioned problem. A good 
choice of the frame from a video sequence can produce a more appropriate input to 
pose and geometry recovery and thereby improve the final result [7]. So the goal of 
key-frame selection is to select a minimal subsequence of feature views from video. 
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such that co-respondence matching still works for all pairs of adjacent in the 
subsequence. 

To achieve this goal, we consider three measures: (i) ratio of the number of 
corresponding points about feature points, (ii) distribution of corresponding points 
about the frame, and (iii) the homography error. Eq. (6) is combination of three 
measures: 

5 = w. (1 - — ) + w,cr + w, 

N, 

where S is the score to select the key-frame, and N,, are the number of 
corresponding points and that of feature points, is a standard deviation of the point 
density, and is a homography error, w, is the weight used to alter the relative 
significance of each score. Typically, the homography error is small when there is 
little camera motion between frames. Homograhpy error is used to evaluate the 
baseline length between two views. Standard deviation of the point density represents 
the distribution of corresponding points. If the corresponding points are evenly 
distributed on the image, we can obtain a more precise fundamental matrix [13]. 
Because the fundamental matrix contains all available information of the camera 
motion, evenly distributed corresponding points set improves final estimation results. 
To evaluate whether points are distributed evenly about image, we divide the entire 
image uniformly into sub-regions based on the number of corresponding points, and 
then calculate the point density of sub-region and that of the image. Standard 
deviation can be represented as: 



J_y 



N 



(7) 



y is the number of sub-regions, and y. are the number of inkers and that in the i- 
th sub-region, respectively. 

The selection process starts by positioning the key-frame at the first frame. All 
possible pairings of the first frame with the consecutive frames in the sequence are 
then considered. Assuming that key-frame has already been placed at frame i, key- 
frame selection is achieved by evaluating the score for a pairing of a current frame 
with the subsequent frame. This is continued until the ratio of the number of 
corresponding points to that of feature points goes down 50%. The frame with the 
lowest score is then marked as the next key-frame. 



4 LMedS Based Absolute Quadric Estimation 

Absolute quadric can be estimated by Eq. (4) from at least three images. Eor a more 
precise absolute quadric estimation, we present a novel approach by using LMedS- 
based random sampling algorithm. The random sets of projection matrices are 
selected from the key-frame set, and the linear equations by Eq. (4) are derived. We 
automatically reject the frame with large errors among key-frames, causing absolute 
quadric estimation to fail. The estimated absolute quadric is projected to each camera 
matrix, and computes each residual: 
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where is the foundation camera matrix that is an initial projection matrix [ 1 | 0 ]. 
We iterate the sampling process and the computing residuals by an arbitrary number, 
and then find absolute quadric with a minimum median residual. From the minimum 
residual, a threshold for rejecting the camera matrix that causes absolute quadric 
estimation to fail can be computed as follows [8]: 

r = 2.5xl.4826[l + 5/(«-<?)]V;~ , (9) 

where n and q are the number of the camera and that of the selected camera, 
respectively. r^^,_,„is the median residual with the minimum. 

This is summary of LMedS based absolute quadric estimation. 

1. Projective reconstruction process. 

2. Random sampling two camera matrices except the foundation camera matrix. 

3. Estimate absolute quadric by Eq. (4) and compute the residual of each camera 
matrix by Eq. (8). 

4. Repeat 2-3 by an arbitrary number, and find absolute quadric having the minimum 
median residual. 

5. Reject camera matrix by Eq. (9) 

6. Re-estimate absolute quadric from the inlier camera matrix set. 

Camera matrices of the rejected frames in two sampling processes are recovered by 
the camera resection algorithm that estimates the camera projection matrix from 
corresponding 3D points and image entities [9]. Corresponding 3D points can be 
easily obtained by matching process over neighboring frames. 



5 Experimental Results 

5.1 Key-Frame Selection 

The proposed method has been tested on four video sequences and compared with 
Nister’s method (A) and S. Gibson’s (B) [6,7]. The simulation is performed on PC 
with Intel Pentium 4 2.3GHz, RAM IGbytes. Table 1 shows the number of key- 
frames and the computation time. In the results by Nister’s method, the computation 
time is much dependent on the size of image, because it computes sharpness of the 
image over all frames. Gibson’s method estimates the fundamental matrix on every 
frame, and selects relatively the smaller number of key-frames than Nister’s. 
However, Gibson’s computation time is almost same that of the Nister’s, since much 
computation loads are required in the estimation of a precise fundamental matrix. On 
the contrary, the proposed method (C) considers the distribution of corresponding 
points on the frame instead of estimating the fundamental matrix directly. Therefore, 
the proposed algorithm is faster than previous methods and the position and the 
number of the selected key-frame is almost alike with those by Gibson’s. Because 
camera pose and scene geometry are estimated on the key-frames, selection of fewer 
and precise those provides computational efficiency and accuracy. 
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Table 1. Numbers of the selected key-frame and the computation time 



Video sequences 


Number of the 
selected key frame 


Computation time 
(sec) 


Type 


Number of total frame 


Size of frame 


A 


B 


C 


A 


B 


C 


Box 


621 


720 X 480 


23 


14 


15 


412 


416 


179 


Desk 


407 


720 X 480 


17 


13 


13 


201 


KiTil 


76 


Fountain 


134 


320 X 240 


28 


12 


12 


1023 


msm 


10 


Cottage 


100 


720 X 576 


27 


27 


24 


74 


102 


42 




Fig. 3. Synthetic model and camera pose 



5.2 Synthetic Data 

We have experimented on the synthetic object to evaluate the proposed auto- 
calibration algorithm. Fig. 3 shows the synthetic model and the camera pose. The 
camera is rotated around the model and moved along positive y-axis at the same time. 
The intrinsic parameters are fixed, and noise is added to the synthetic data. We have 
estimated absolute quadric with the linear method, bundle adjustment [9] and the 
proposed algorithm on two merging approaches: sequential merging [10] and 
hierarchical merging algorithm [6]. Fig. 4 and 5 represent an accumulation error of 
camera intrinsic parameters by an absolute quadric, and the comparison of camera 
poses recovery. The sequential merging algorithm is dependent on an initial estimate 
of structure, and the error is propagated more and more over time. On the other hand, 
the hierarchical merging algorithm distributes the error over an entire sequence 
evenly. Therefore, the performance of hierarchical merging algorithm is better than 
that of sequential merging algorithm. In addition, bundle adjustment merges better 
than the linear method, and the proposed method achieve much more precise results 
than bundle adjustment at the sequential merging as shown in Fig. 4 and 5. 



5.3 Real Image Sequences 

We have tested the proposed method and previous methods on the real video 
sequence. The number of the frame is 621 and the image size is 720x480. Three 
frames among the video sequence are shown in Fig. 6(a). (b) represents in a graph 
form 15 frames selected by the proposed method. The graph shows the homography 
error in every frame. 6(c) gives the results by the linear method, bundle adjustment 
and the proposed method, respectively, on two merging algorithms, which are the 
sequential and the hierarchical merging. The results are obtained from accumulating 
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Fig. 4. Intrinsic parameter error graph 
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Fig. 5. Recovered camera motion and scene geometry 
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Fig. 6. (a) Real video sequence, (b) the selected key-frames and (c) principle point 
accumulation error graph 
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the distance error of the original principle points and the obtained. Fig. 6(c) shows the 
proposed method can achieve precise camera pose estimation without bundle 
adjustment. 



5.4 Match Move 

The proposed algorithm has been tested on two real image sequences. The number of 
the first image sequence is 16 and the image size is 800x600. The virtual object is 
combined with the real scene by using the recovered camera motion and the scene 
geometry. Fig. 7 shows the recovered camera motion and the textured scene geometry 
from the first image sequence. In the second images, the number of frames on the real 
video sequence is 407 and the image size is 720x480. Fig. 8 shows augmented video 
sequences of two images. 




Fig. 7. Recovered camera pose and textured scene geometry 




Fig. 8. Augmented video sequences 



6 Conclusion 

3D structure and motion recovery is important for video sequence processing. 
Previous methods use bundle adjustment or non linear minimize process that is 
complex and difficult problem. This paper proposes a new algorithm for camera 
motion estimation and scene geometry recovery based on two processes: key-frame 
selection and an absolute quadric estimation by LMedS. The experimental results 
demonstrated that the proposed method can achieve a more precise estimation of 
camera and scene structure of a couple of image sequences than previous methods. 
For resolving virtual-real occlusion or collision, further research about a precise 
modeling of scene is needed. 
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Abstract. This work presents a multiscale image approach for contrast 
enhancement and segmentation based on a composition of contrast op- 
erators. The contrast operators are built by means of the opening and 
closing by reconstruction. The operator that works on bright regions uses 
the opening and the identity as primitives, while the one working on the 
dark zones uses the closing and the identity as primitives. To select the 
primitives, a contrast criterion given by the connected tophat transfor- 
mation is proposed. This choice enables us to introduce a well-defined 
contrast in the output image. By applying these operators by composi- 
tion according to the scale parameter, the output image not only pre- 
serves a well-defined contrast at each scale, but also increases the contrast 
at finer scales. Because of the use of connected transformations to build 
these operators, the principal edges of the input image are preserved and 
enhanced in the output image. Finally, these operators are improved by 
applying an anamorphosis to the regions verifying the criterion. 



1 Introduction 

Image enhancement is a useful technique in image processing that permits the 
improvement of the visual appearance of the image or provides a transformed im- 
age that enables other image processing tasks (image segmentation, for example). 
Methods in image enhancement are generally classified into spatial methods and 
frequency domain ones. The present work is focused on the spatial methods, and 
in particular, to the use of morphological image transformations. The methods 
presented here, not only have the objective of improving the visualization, but 
also to be used as a preprocessing step for image segmentation. In mathematical 
morphology (MM), several works have been focused on the contrast enhance- 
ment ([1],[2], [3], [4], [5], [6]). Among them, some interesting works concerning 
multiscale contrast enhancement were made by Toet ([!]), and Mukhopadhyay 
and Chanda ([2]). Toet proposes an image decomposition scheme based on local 
luminance contrast for the fusion of images. In particular, the use of the alternat- 
ing sequential morphological filters as a class of low-pass filters was proposed in 
[1]. On the other hand, Mukhopadhyay and Chanda [2] propose a decomposition 
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of the image based on the residues obtained by the tophat transformation. How- 
ever, the first formal work in morphological contrast was made by Meyer and 
Serra [6] who propose a framework theory for morphological contrast enhance- 
ment based on the activity lattice structure. In their work, the original idea of 
Kramer and Bruckner (KB) transformation [7] was used. This transformation, 
which sharpens the transitions between the object and background, changes the 
gray value of the original image at each point of the image for the closest of the 
dilation and erosion values. The contrast operators proposed by Serra and Meyer 
progress in the way suggested by KB, but the hypotheses are modified. They not 
only assume that the transformations are extensive and anti-extensive, but also 
that the transformations must be idempotent. The use of this last hypothesis to 
build contrast operators, avoids the risk of degrading the image. Another form 
that allows an attenuation of the image degradation problem in the KB algo- 
rithm was proposed by Terol-Villalobos [8]. In his work, the dilation and erosion 
transformations are also used as in the KB transformation, but in a separated 
way to build a class of non-increasing filters called morphological slope filters 
(MSF). On the other hand, the proximity criterion is not used, and a gradient 
criterion is introduced. An extension of this class of filters was proposed in [5]. 
In this case, the MSF are sequentially applied rendering a selection of features at 
each level of the sequence of filters. Recently, in [9], a class of connected MSF was 
proposed by working with the fiat zone notion. The present work progresses in 
the same way suggested in [6], but using the opening and closing by reconstruc- 
tion as primitives. However, in a similar manner for the dilatation and erosion 
in MSF, the opening and closing by reconstruction will be used separately. On 
the other hand, the proximity criterion will be avoided and a contrast criterion, 
given by the tophat transformation will be used for selecting the primitives. Fur- 
thermore, the use of a tophat criterion is combined with the notion of multiscale 
processing to originate a powerful class of contrast mappings. The use of filters 
by reconstruction, that form a class of connected filters, will allow the definition 
of a multiscale approach for contrast enhancement. 

2 Some Basic Concepts of Morphological Filtering 

2.1 Basic Notions of Morphological Filtering 

The basic morphological filters ([10]) are the morphological opening 7^3 and the 
morphological closing with a given structuring element ; where, in this work, 
B is an elementary structuring element (3x3 pixels) that contains its origin. B 
is the transposed set {B = {—x : x G B}) and p is an homothetic parameter. 
The morphological opening and closing are given, respectively, by: 

l^iB{f){x) = S^B{e^B{f)){x) and 7>/xs(/)(a;) = e^B(<5^B(/))(a;) (1) 

where the morphological erosion e^B and dilation S^b are expressed by: 
£mb(/)(^) = ^{f(y) ■ y G and <5 ^b(/)(x) = V{/(y) : y G pB^}. A is 

the inf operator and V is the sup operator. In the following, we will avoid the 
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elementary structuring element B. The expressions are equivalent (i.e. 

In = InB )• When the homothetic parameter is /x = 1, the structuring element 
B will also be avoided (i.e. Sb = 5). When ^ = 0, the structuring element is a 
set made up of one point (the origin). 

Another class of filters is composed by the opening and closing by reconstruc- 
tion. When filters by reconstruction are built, the basic geodesic transformations, 
the geodesic dilation and the geodesic erosion of size 1, are iterated until idem- 
potence is reached [11]. Where the geodesic dilation and the geodesic erosion of 
size one are given by Sj{g) = / A 6{g) with g < f and £^f{g) = / V s{g) with 
9 ^ fj respectively. When the function g is equal to the erosion or the dilation of 
the original function, we obtain the opening and the closing by reconstruction: 

7/x(/) = lim 6]{£^{f)) = lim £]{S^{f)) (2) 

71 . — VrjO ■' T>. — Voo •' 



2.2 Connectivity and Connected Tophat Transformations 

An interesting way of introducing connectivity for functions is via the flat zone 
notion and partitions. Concerning the flat zone notion, one says that the flat 
zones of a function are the largest connected components of points with the 
same gray- level value. On the other hand, a partition of a space A is a set of 
connected components {W} which are disjoint {Xi fl Xj = 0) and the union 
is the entire space {UXi = E). Thus, since the set of flat zones of a function 
constitutes a partition of the space, a connected operator for functions can be 
defined as follows. 

Definition 1 An operator ip acting on gray-level functions is said to be con- 
nected if, for any function f the partition of flat zones of if{f), is less fine than 
the partition f. 

The openings and closings by reconstruction (eqn. (2)) are the basic morpho- 
logical connected filters. When applying these filters the flat zones increase the 
size with /x; the flat zones are merged. Based on these transformations, other con- 
nected transformations can be defined. In particular the connected tophat trans- 
formation is computed by the arithmetic pointwise difference of the original func- 
tion from the opened one (or the difference of the closed function from the orig- 
inal one): Thw\{f){x) = f{x) - ^x{f){x) (and Thb\{f){x) = px{f){x) - f{x)). 
Below, the opening (closing) by reconstruction will be used as primitive to built 
the contrast operator, while the connected tophat transformation will be used 
as criterion to select the primitives. 



3 Contrast Mappings Based on Tophat Criteria 



In this section, a study of the contrast mappings, using a contrast criterion given 
by the tophat transformation, is made. The interest in the use of this type of 
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criterion consists in knowing strictly the contrast introduced in the output image. 
Consider the two-state contrast mappings defined by the following relationships: 

f if[f - %if)Kx) < (j) 

[ f{x) if[f 

{ */[^m(/) - /K^) ^ 

<^(/)(^)= ( 3 ) 

[ fix) if[vM)-f]i^)>(l^ 

Both operators ^ and ^ are connected; the partitions of the output 
images computed by these operators are composed by fiat zones of / and other 
fiat zones merged by 7^ and tp^. The first operator works on bright structures, 
whereas the second one on the dark regions. The use of a contrast criterion to 
build these operators permits the classification of the points in the domain of 
definition of /in two sets. A set S'^,0(/) composed by the regions of high contrast, 
where for all points x G 

[f for and [^^,if) - f]ix) > 4 > for 

and the set composed of weak contrast zones (the complement of 

Sfj.,<i>if)), where for all points x G 

[/-7m(/)](2;) < </ for and [^^,if) - f]{x) < (j) for 

Remark 1 In general, the operator kJ^ ^ using the opening as pattern will be 
analyzed. Then, for convenience, the notation will be used instead of 
When it is required, the type of primitive (opening or closing) will be specified. 

Now, let us show how the contrast is modified for obtaining the output image. 
By construction, the contrast mapping ^ = is an anti-extensive transfor- 
mation. Thus, the following inclusion relation is verified: 7/^(/) < ri^^(f,{f) < f. 
Then, the operator increases the contrast of a region by attenuating the neighbor- 
ing regions with the opening by reconstruction. On the other hand, the contrast 
operators based on a tophat criterion not only classify the high and weak con- 
trast regions ( </>(/) and S()^^{f) ) of the input image, but also impose a well 

defined contrast to the output image, as expressed by the following property: 

Property 1 The output image computed by has a well-defined contrast. 
For all point x of its domain of definition, the tophat transformation value of 

[KuAf)i^) - 7/x(^A«,0(/))](a;) > f for all points x G 5 ^,0(/) and 

[Kn, 4 >if)i^) - = 0 for all points x G 5 / ,^(/) 

Now, since S'^,0 (k;x, 0(/)) = one has: 

Property 2 The two-state contrast operator is an idempotent transforma- 
tion; = Kfi,4,{f). 
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Similar results can be expressed for the contrast operator using the closing by 
reconstruction and the original function as primitives. Finally, in a composition 
of two-state contrast mappings based on the parameter (p the strongest operator 
imposes its effects. ^fj,,max{(pi^(jyj} 

4 Multiscale Morphological Contrast 

In this section, the different scales (sizes) of the image will be taken into ac- 
count for increasing the contrast of the output image. To introduce the scale 
parameter, a composition of contrast operators depending on the size parameter 
will be applied. In order to generate a multiscale processing method some prop- 
erties are needed. Between them, causality and edge preservation are the most 
important ones [12]. Causality implies that coarser scales can only be caused by 
what happened at finer scales. The derived images contain less and less details: 
some structures are preserved; others are removed from one scale to the next. 
Particularly, the transformations should not create new structures at coarser 
scales. On the other hand, if the goal of image enhancement is to provide an 
image for image segmentation, one requires the edge preservation; the contours 
must remain sharp and not displaced. It is clear that openings and closings by 
reconstruction preserve contours and regional extreme. In fact, they form the 
main tools for multiscale morphological image processing. Consider the case of 
a composition of two contrast operators, and defined by the update equation: 

I */[k/xi.0(/) -7A«2(K/xi.0(/))](a:) > 

( 4 ) 

Figure 1 illustrates an example of a composition of two-state contrast map- 
pings with parameters /xi < /X2 and p. In Fig. 1(b) the opening size of the function 
in Fig. 1(a) is illustrated in dark gray color, whereas the regions removed by the 
opening are shown in bright gray color. In Fig. 1(c) the output funtion 
is shown. Finally, in Figs. 1(d) and 1(e) the opening 7^2 <#>(/)) the out- 
put image computed by the composition K/22,0^/^1,0 illustrated. Observe that 
the high contrast regions of are not modified by 7^3 as shown in Fig. 1(e). 
In the general case, when a family of contrast operators is applied by 

composition, the following property is obtained. 

Property 3 The composition of a family of contrast operators , with 

Hi < pi 2 <■■■< Hn , preserves a well-defined contrast at each scale. For a given 
Hi, such that I <i <n, and for all points x € 

TfJ.ti^nn,4’ ' ' ’ ^A»2,0^Ml,0(/))](^) ^ 4^’ 

and for all points x G 
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ie) 



Fig. 1. a) Original function, b) Opening by reconstruction 7^i(/), c) Output func- 
tion </>(/), d) Opening by reconstruction 7/i2(^Mi,<i>(/))5 ®) Output function 



and the structures at scale fit are preserved, ■■■ Kfj, 2 , 4 >^ui, 4 >(f)) ~ 

• • • Kp2.<i>^Aii,<i>(/)) 

For a composition of a family of contrast operators the structures at 

scale Hi are preserved under the condition 0i > (^2 > • • • > 4>n- Finally, the com- 
position of contrast operators increases the contrast at finer scales as expressed 
by the following property: 

Property 4 In a composition of a family of contrast operators , with 

Hi < H 2 < ‘ < IJ'm the following inclusion relation can he established. For a 
given Hi such that 1 < i < n, and for all points x € • • • «^/i2,0^Au.0(/))i 

■ ■ ■ ^Al2, 0^/21, <#>(/) ~ "i iH ,(j> ■ ■ ■ ^M2.<i>^Ml><i>(/))](^) 

Figure 2(b) shows the output image viii 2 , 4 >>^iii,<i>{f)i with hi = 64, H 2 = 96, 
and 4> = 15, while Fig. 2(d) and 2(e) illustrate the binary images computed 
by a threshold between 10 and 255 gray-levels of the internal gradients of the 
original image and the output image Kfi 2 ,<pK.fii,<p{f)- Figures 2(c) and 2(f) show 
the output image with hi = 32, H 2 = 64, H 2 = 96, 4>i = 10, 

<l >2 = 4>3 = 15, and its binary image computed by means of a threshold between 
10 and 255 gray-levels of its internal gradient. Finally, Figs. 2(g)-(i) illustrate 
the segmented images obtained from the images in Figs. 2(d)-(f). By comparing 
Figs. 2(h) and 2(i) observe the structures at scale Hi = 32 introduced in the 
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Fig. 2. a) Original image, b), c) Output images </>(/)> d), 

e), f), Gradient contours of images, (a),(b),(c), and g), h), i) Segmented images 



output image 2(f). For a complete study of image segmentation in mathemathical 
morphology see [13]. 

5 Some Improved Multiscale Contrast Algorithms 

The above-described approach presents a main drawback; this approach does 
not permit to increase the gray-levels of the image. To attenuate this inconve- 
nience, some modifications to the multiscale contrast operator above-proposed 
are studied. On the other hand, when a family of contrast operators is applied 
by composition, one begins with a smallest structuring element. Here, one also 
illustrates the case of a composition of contrast mappings beginning with the 
greatest structuring element. 

5.1 Linear Anamorphosis Applied on the Tophat Image 

Consider the following residue: = 0 if f — < (p and a^^ 0 (/)(x) = 

[f if > 4>- Then, the output image can be expressed 

in the form: Now, let us take the linear anamorphosis: 

ao/x, (/)(/)) where a is a positive integer. Then, a new two-state contrast mapping 
will be defined by; 

(5) 

When the parameter a is equal to one, we have ^ Take a as the 

minimum integer value that enables us to increase the contrast (a = 2 ). Let us 
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consider some conditions for this last operator. If the parameter (j) takes the zero 
value, the extensive operator ^ = [/+ (/ ~ 7 m(/))] obtained, and if (j) takes 
the maximum value of function a^, 0 (/), one has that for all points in the domain 
of definition of f, Now, for other (j) values, one can define the set of 

high contrast points that satisfy: for all points x £ 0 (/))j 

and K'^^^(f)(x)-%(f)(x) = 2a^,^(f)(xj > 4>. With 
regard to the set of weak contrast points 5'° 0 (/))> one has: for all points 

a; e S';;, 0 ( 4 , <!>(/))> = l^,if)ix) and - Jf,(f)ix) = 0. Thus, 

the operator ^ increases twice the contrast of the output image with respect 
to the operator If a composition of two contrast operators </>i(/) 

(/xi < /X 2 ) is applied, the parameter (j )2 can take twice the value of /pi without 
affecting the structures preserved by the first operator. But this is not the only 
advantage, because this operator increases the contrast of a region using two 
ways: by attenuating the neighboring regions and by increasing its gray-levels. 
The images in Fig. 3 (b) and 3(c) illustrate the performance of this operator. A 
composition of three contrast mappings 02 <t>i' ^hh /xi = S, 112 = 16, 

fj -2 = 48, and (j>i = 0, </>2 = 3, ((>3 = 7 was applied to the original image in Fig. 
3(a) to obtain the image in Fig. 3(b), whereas the composition <i> 2 ^M 3 03’ 

with the same parameters was used to compute the image in Fig. 3(c). 



5.2 Contrast Operator on Bright and Dark Regions 

All the contrast operators, introduced in this paper, work separately with bright 
or dark regions. Here, the main interest is to introduce a contrast operator that 
permits the process of both regions. Consider the following operators 

+ oia^M^ix) K^^^{f){x) = !pf,{f){x) - aa'f^^^{f){x) 

These contrast operators will be used as the primitives for building a new 
contrast operator. Observe that symbols 7 and ip are now introduced in the con- 
trast operators. In order to build such an operator, another criterion to choose 
the primitives must be introduced. The natural criterion is the comparison be- 
tween the tophat on white regions with that on black regions. Thus, the contrast 
operator, working on bright and dark regions, will be given by: 

f %L0(/)(a^) */[^M2(/) - f]{x) < [/ - ltiiif)]{x) 

= , ( 6 ) 

I '^AT 2 , 0 (/)(a^) otherwise 

Figure 3(d) shows the output image computed from the image in Fig. 3(a) by 
the contrast mapping kJ^ . The selected sizes for the primitives were 

Pi = 16, ^2 = 8 for the first operator and p[ = 48, p '2 = 16 for the second one, 
while the parameter values 4>\, <p 2 were taken equal to zero. A similar composition 
of contrast operators was applied on Fig. 3(a) for obtaining the image in Fig. 
3(e), but in this case the parameter values pi and p 2 were selected as (()i = 0 
and p 2 = 7. Compare this last image with that in Fig. 3(d), and observe how the 
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(a) (b) (c) 




(d) (e) (f) 



Fig. 3. a) Original image /, b), c) Output images (/) 

parameters /n = 8, /T 2 = 16, p 2 = 48, and 0i = 0, 02 = 3, 
03 = 7, d), e) Output images parameters ^ii = 16, fi 2 = 8, 

fi[ = 48, fi 2 = 16 using 0i = 02 = 0 for (d) and 0i = 0, 02 = 7 for (e), f) Output 
image fj.' with the same parameters of (e). 



contrast is increased when the gray-level of some regions is attenuated by the 
opening or closing. Finally, Fig. 3(f) illustrates the output image computed by 
^^2 <f>i Ai' 02 ■ same parameter values used to compute the image 

in Fig. 3(e). 



6 Conclusion and Future Works 

In this work, a multiscale connected approach for contrast enhancement and 
segmentation based on connected contrast mappings has been proposed. For 
building the contrast operators, a contrast criterion given by the tophat trans- 
formation was used. This type of criterion permits the building of new contrast 
operators which will enable us to obtain images with a well-defined contrast. 
When applying by composition a family of contrast operators depending on a 
size parameter, a multiscale algorithm for image enhancement was generated. 
The output image computed by a composition of contrast operators preserves 
a well-defined contrast at each scale of the family. Finally, the use of anamor- 
phoses was introduced to propose some improved multiscale algorithms. Future 
works on the multiscale contrast approach will be in the direction to extend the 
approach to the morphological hat-transform scale spaces proposed in [14]. 
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Abstract. In this paper, we present the development of an active object 
recognition system. Our system uses a mutual information framework in 
order to choose an optimal sensor configuration for recognizing an un- 
known object. System builds a conditional probability density functions 
database for some observed features over a discrete set of sensor con- 
figurations for a set of interesting objects. Using a sequential decision 
making process, our system determines an optimal action (sensor config- 
uration) that augments discrimination between objects in our database. 
We iterate this procedure until a decision about the class of the unknown 
object can be made. Actions include pan, tilt and zoom values for an ac- 
tive camera. Features include the color patch mean over a region in our 
image. We have tested on a set composed of 8 different soda bottles and 
we have obtained a recognition rate of about 95%. Sequential decision 
length was of 4 actions in the average for a decision to be made. 



1 Introduction 

Object recognition is a very important task in robot navigation because it is one 
of the aspects that enable a system to behave autonomously. This capability is 
essential when identifying a pre-planified path or when avoiding an obstacle [1]. 

An active sensor is very useful in robot navigation because enables the robot 
to look for objects known to be in its proximity without needing to change its 
path. Active sensing tasks have been studied by Mihaylova [2]. 

Some recent works use mutual information frameworks to determine the 
best actions to command an active camera in order to recognize some objects 
[3] [4] . Some others deal with sensor control for path servoing in robot navigation 
[5][6][7]. 

In this paper we develop a system to actively recognize a set of objects by 
choosing a sequence of actions for an active camera that helps to discriminate 
between the objects in a learned database. Rest of this paper is organized as 
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follows: In section 2, we state our problem formulation using mutual information. 
A description of our system implementation is given in section 3. In section 4, 
we present our test and results. Finally, we present our conclusion and the work 
to be done in the future in section 5. 

2 Problem Formulation 

An active sensor can be very useful when recognizing an object. Denzler[3] [4] 
has proposed to use an information theoretic approach for the object recognition 
problem. His approach consists in using a smart sensor and to choose succes- 
sive configuration steps in order to discriminate between objects in a learned 
database. 

Uncertainty and ambiguity between objects are reduced by a camera action 
that maximizes mutual information. The key point of his work is the integration 
of the probabilistic models for active sensors with the effect of the different 
actions available. 

Let us define Xt as the estimated state for recognition of flk classes, k G 
{l,n} at iteration t. At each step, we are interested in computing the true state 
given an observation ot ■ According to the information theory, framework optimal 
estimation is given by an action at that optimizes mutual information. Mutual 
information is defined as 



I{xt, at\ot) = H{xt) - H{xt\ot,at) (1) 

where H{-) denotes the entropy of a probability distribution. Considering 



and 



H{xt) = - p{xt)logp{xt)dxt 

J Xt 

I{xt',at\ot) = f j p(xt)p(ot|a;t,at)log^*'°‘^^*’“*^ 

J X+ J Ot 



p{ot,at) 



dotdxf 



the optimal action that maximizes mutual information is given by 

al = maxJ(a;t|ot,at). 



( 2 ) 

( 3 ) 

( 4 ) 



3 System Implementation 

We can divide the implementation of active object recognition system in two 
parts: the learning phase and the recognition phase. 



3.1 Learning Phase 

Objective of the learning phase is to create a conditional probability density 
function database linking actions in the configuration space of the active sensor 
and objects in our database. 
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We consider actions like pan, tilt and zoom configuration values for an active 
camera. We divide the full range of every configurable action into a set of discrete 
values, that is, every action at is defined as: 

~ {Pk,t ^k,t) (b) 



where pk, k G {l,np} is a pan value, tk, k G {l,nt} is a tilt value and Zk^ 
k G {l,nz} is a zoom value, with np,rit,nz being respectively the number of 
discrete steps for the ranges of pan, tilt and zoom values. 

We can use different properties of an object to characterize it. Among these 
properties we can find the chromatic or grayscale intensity, size, form, edges, etc. 
In this work, we considered the chromatic intensity of the object as the feature 
to use for the active object recognition. Chromatic intensities for objects in our 
database were modelled using gaussian probability density functions. 

During the learning phase, we compute an RGB intensity mean vector using 
Equation 1. These values characterize the objects in our database at a given 
sensor configuration. We assume a normal distribution of intensities for illumi- 
nation variations. To obtain the parameters we compute mean and variance for 
several runs under the same sensor configuration. 



Ip 







( I r{i) 

\h{i) 



(6) 



In Equation 1, Ip represents the mean intensity for each image. A represents 
the intensity for each pixel of that image. And n is the total number of pixels of 
the image. With these values we can compute the probability of observing some 
characteristic of the object when the sensor shows a given configuration state. 

We have then the conditional probability for observing a feature c* given 
some action at as: 

^’(ct|at) = f P{ct\xt,at)P{xt)dxt- (7) 

Jxt 

We used 8 different objects with similar properties for our database. These 
objects are shown in Figure 1. We obtained 30 different images for each object. 
The process was repeated 8 times, each time in a different position of the object. 
This procedure let us to capture an object model including distinctive proper- 
ties. As we have taken images from different viewpoints of each object, we can 
recognize objects even if they show a different aspect from the learned ones. 

In Figure 2, we can observe the different graphs of the RGB mean intensities 
for each object for a range of pan values at a fixed setup for tilt and zoom 
values. These features let us to discriminate the correct class for the unknown 
item among all the objects in our database. 



3.2 Recognition Phase 

The objective of this phase is the active object recognition. That is the main 
part of this work. When an unknown object from our database is presented to 
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Fig. 1. Images of the objects in our database. 
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Fig. 2. Red, green and blue mean intensities for the color patches of the objects over 
a range of pan values. 



the system, a sequential process is started. We assume equal a priori probability 
of the unknown object belonging to each class 17^. 

Firstly, mutual information is computed as follows: 

K 

/o(l7,c|a) = ^ ek{a)Pk 



( 8 ) 
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where entropies are defined as: 



efc(«) = ^-P(ci I log (9) 

^ P[Ci\a) 

Mutual information will provide us the best matching between current esti- 
mated state and the observation made at this step. Then we look for the action 
ttg that maximizes mutual information: 

Oq = max/o(f2, c|a) (10) 



We need to execute this action on the sensor and also it is required to up- 
date a priori probabilities Pk for each possible class. This process will reinforce 
probability of maybe ambiguous classes for the unknown object and in the other 
hand, it will weaken probabilities of the unknown object belonging to non-similar 
classes. 



7^(co I |uo) 

-P(co|ao) 



( 11 ) 



We iterate this procedure in a sequential way until the probability of the most 
probable class to which the unknown object belongs exceeds a given certainty 
threshold. 



4 Test and Results 



We present tests and results of our implementation in this section. 

We present temporal evolution of a posteriori probabilities for each class 
when an object of type 2 was presented to our system as a unknown object. The 
graphs that we can see in Figure 4 show the sequential decision process results. 
In this case, our system identifies correctly the object under test. We can see also 
the number of iterations needed by the system in order to identify the object. 
We start from an equal probability hypothesis for all the object classes, mutual 
information let us to choose actions that make converge our decision process in 
a scenario where only one class is correct as a label for the unknown object. 
In Figure 3, we show another run of the system with a different object. In this 
graph, we can see that the number of iterations to decide a label for the unknown 
object depends on the actual view of the object and the initial setting for the 
camera action. In both cases, the number of iterations needed to choose a label 
for the object under test is different and depends on the unknown object. 

A recognition rate of 95% was achieved under the following conditions: Il- 
lumination should be similar during both the learning and recognition phases. 
The acquisition of the images should be done to the same distance between the 
camera and the object, even if the object could be turned around the vertical 
symmetry axis of the bottles with respect to the original position. 
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Fig. 3. Recognition of object 5. 



Fig. 4. Recognition of object 2. 



5 Conclusions and Perspectives 

We have presented a system that can perform an active object recognition task. 
It achieves a good recognition rate (95 %) even if it is sensible to illumination 
changes. The acquired image of the unknown object has not been learnt in 
advance. We acquire an image of one of the objects in our database but not 
necessarily under the same environment conditions where the learning phase 
has been carried out. Nevertheless, our system is robust to rotation around the 
vertical symmetry axis. 

If we could take some more robust features that do not depend on illumina- 
tion variability as the objects features, recognition rate for our system could be 
improved. 

Our future work will be directed towards: 

Increasing the number of features to take into account. We can recognize 
objects in a more reliable way. 

To combine the active sensor with some kind of active handler (like a 
turntable). Action will be then composed by the states of the active sensor 
augmented by the state of the turntable. This setup will help in the recogni- 
tion phase because we could choose view-action pairs where the object can be 
recognized more easily. 

Modelling the conditional probability functions of normal distributions by 
using fuzzy logic rules. This will save amounts of memory because we will store 
probability density functions as a set of fuzzy rules. That also will enable us 
to implement non parametric conditional probability density functions for the 
object-action pairs. 
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Abstract. Image metamorphosis, commonly known as morphing, is a powerful 
tool for visual effects that consists of the fluid transformation of one digital 
image into another. There are many techniques for image metamorphosis, but in 
all of them there is a need for a person to supply the correspondence between 
the features in the source image and target image. In this paper we describe a 
method to perform the metamorphosis of face images in frontal view with 
uniform illumination automatically, using a generic model of a face and 
evolution strategies to find the features in both face images. 



1 Introduction 

Image metamorphosis is a powerful tool for visual effects that consists of the fluid 
transformation of one digital image into another. This process, commonly known as 
morphing [1], has received much attention in recent years. This technique is used for 
visual effects in films and television [2, 3], and it is also used for recognition of faces 
and objects [4]. 

Image metamorphosis is performed by coupling image warping with color 
interpolation. Image warping applies 2D geometric transformations to images to 
retain geometric alignment between their features, while color interpolation blends 
their colors. 

The quality of a morphing sequence depends on the solution of three problems: 
feature specification, warp generation and transition control. Feature specification is 
performed by a person who chooses the correspondence between pairs of feature 
primitives. In actual morphing algorithms, meshes [3, 5, 6], line segments [7, 8, 9], or 
points [10, 11, 12] are used to determine feature positions in the images. Each 
primitive specifies an image feature, or landmark. Feature correspondence is then 
used to compute mapping functions that define the spatial relationship between all 
points in both images. These mapping functions are known as warp functions and are 
used to interpolate the positions of the features across the morph sequence. Once both 
images have been warped into alignment for intermediate feature positions, ordinary 
color interpolation (cross-dissolve) is performed to generate image morphing. 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 679-687, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 




680 



V. Zanella and O. Puentes 



Transition control determines the rate of warping and color blending across the morph 
sequence. 

Feature specification is the most tedious aspect of morphing, since it requires a 
person to determine the landmarks in the images. A way to determine the landmarks 
automatically, without the participation of a human, would be desirable. In this sense, 
evolution strategies could be an efficient tool to solve this problem. 

In this work, we use evolutionary strategies and a generic model of the face to find 
the facial features and the spatial relationship between all points in both images, 
without the intervention of a human expert. We initially chose work with images of 
faces in frontal view with uniform illumination and without glasses and facial hair to 
simplify the problem. 



2 Feature Specification 

Many methods have been developed to extract facial features. Most of them are based 
on neural nets [13], geometrical features of images [14], and template matching 
methods [15, 16]. 

Unfortunately, most methods require significant human participation; for example 
in [14] a total of 1000 faces were measured, and the locations of 40 points on each 
face were recorded, to build the training set of faces. In this work we do not need a 
training set of faces to find the model, instead, we use a model of 73 points based on a 
simple parameterized face model [17], (Figure 1). The model does not rely on color or 
texture, it only uses information about the geometrical relationship among the 
elements of the face. For example, we use the fact that the eyes are always at the same 
level and above the mouth in a face in frontal view. 

The components of the face model that we used are the eyes, eyebrow, nose, 
mouth, forehead and chin. 




Fig. 1. The Face Model 
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2.1 Evolution Strategies 



Evolutionary strategies are algorithms based on Darwin’s theory of natural evolution, 
which states that only the best individuals survive and reproduce. The procedure starts 
by choosing randomly a number of possible solutions within the search space in order 
to generate an initial population. Based on a fitness function calculated for each 
solution, the best members of the population are allowed to take part in reproduction, 
this procedure is called selection. Genetic operations are performed on individuals 
with the idea that evolved solutions should combine promising structures from their 
ancestors to produce an improved population. Usually two types of genetic operations 
are utilized: crossover and mutation. Crossover is the combination of information of 
two or more individuals and mutation is the modification of information from an 
individual. 

Our algorithm is a (1+1)-ES algorithm [18], i.e. the initial population is formed 
only by one individual (a face model) then mutation is the only operation utilized. 

The form of the individual is / = ((Vj , Tj ),(V 2 , T 2 )’•••’ (-^73 ’ T 73 )) ’ that 

corresponds to the 73 points in the model. The mutation operation corresponds to an 
affine transformation with parameters s^, s^, t^, t^, where s^ and s^ are the scale 
parameters in x and y respectively, and t^ and t^ are the translation parameters in x and 
y. Rotation in this case is not used because we assume that the images are in frontal 
view and not rotated. 

The mutation operation consists of modifying the translation and scale parameters. 
It adds normal random numbers with mean jU and standard deviation <T, N(jU,<7), in the 
following way: 



= ?, + W,l) 


(1) 


= ?,+W,i) 


(2) 


= v^+A^(l,0.5) 


(3) 


= 5^+A^(1,0.5) 


(4) 



Scale and translation are performed using the following matrix: 



and 



( S X 




0 0 ^ 

S y 0 




0 

1 



t y 



0 ^ 

0 



(5) 



(6) 



In this way, we perform the following operations for each point in the model 
{x' , y' X) — (v, y,l) ■ 5 by the scale and [x', y' X) — (v, yX)'T by the translation. 
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For the fitness function we need to find the binary image, and the contour image, 
corresponding to the source and target images. We use the fact that the images have 
uniform illumination; in this case most of the points in the image depict skin and then 
we can find the regions in the face that are darker than skin, for example the eyes or 
mouth. We segment the image using a threshold near the mean intensity of the pixels 
representing skin, for this, first we convert the original image into a gray scale image 
/, after that the threshold Th is calculated from the image histogram as follows: 

255 

y hii) ■ i 

Th = ^ ( 7 ) 

/—I 

where h(i) corresponds to the number of points with gray intensity i in the histogram 
of the image. The value of Th corresponds to the average value of the gray intensity of 
the image. With Th we find the eyes and mouth regions in the points with values less 
than Th, to find a binary image 





(a) (b) 

Fig. 2. (a) Contours image (b) Binary image 



Once we have the binary image and the contour image (p(I), obtained with the Canny 
operator, (Figure 2), we compute the following function: 



Fitness = ) . ( 8 ) 



Where 

y.. = ■ (9) 

^El ^Er 



and 

Kau.,= \\m))dl- ( 10 ) 

The symbols 7?^ , and correspond to the left eye, right eye and mouth regions 
respectively, (see Figure 3). 
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After the Fitness is found, we adjust the chin, forehead and the eyebrows, so that: 

5 = ) ( 11 ) 

here 

NPeyebro«s= ^ (pWs + ^(p{l)ds ( 12 ) 

^EBl ^EBr 

NP.Mn.foreUea, = j ViPds + j^(/)t/. (13) 

BpH Bp 

Here and are the left eyebrow, right eyebrow, forehead and chin 

edges, ds, respectively and (p(I) is the contour image of image I , Figure 3. 

This process is performed with the source image and the target image. The results 
are shown in Figures 4 and 5. 




Fig. 3. Regions and edges of the face model Fig. 4. The best individual adjusted 

to the segmented image 




(a) (b) fc) 



Fig. 5. (a) First individual, fb) Intermediate individual and (c) The best individual 
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3 Warp Generation 

Once the model has been adjusted to the images, the next step is to perform image 
deformation, or warping, by mapping each feature in the source image to its 
corresponding feature in the target image. When we use point-based feature 
specification, we must deal with the problem of scattered data interpolation. 

The problem of scattered data interpolation is to find a real valued multivariate 
function interpolating a finite set of irregularly located data points [19]. For bivariate 
functions, this can be formulated as: 

Input, n data points (x., jj), jc, g y, g 91 , i=l, ..., n. 

Output. A continuous function/: 91^— >91 interpolating the given data points, i.e. 
/(ji:i)=y,„i=l, ..., n. 

In this work we use the inverse distance weighted interpolation method. 



3.1 Inverse Distance Weighted Interpolation Method 

In the inverse distance weighted interpolation method [19], for each data point Pj, a 
local approximation /(p):9I^— >91 with /(Pj) = y^, i=l,..,n is determined. The 
interpolation function is a weighted average of these local approximations, with 
weights dependent on the distance from the observed point to the given points, 

n 

/(P) = 2^>V/(P)//(P) (14) 

i-\ 

Where /(Pj) = y^, i=\,..,n. Wj:91^^91 is the weight function: 

"',(?)= (15) 

7=1 

with 

Ip-P/I 

The exponent jii controls the smoothness of the interpolation. 



4 Transition Control 

To obtain the transition between the source image and the target image we use linear 
interpolation of their attributes. If / and / are the source and target images we 
generate the sequence of images , /g [0,1], such that 
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(17) 

This method is called cross-dissolve. 



5 Results 

We tested the method with images of faces in frontal view with uniform illumination 
without glasses and facial hair. The run time on average is 30 seconds to perform the 
metamorphosis on a 2.0 Ghz Pentium IV machine with 128 Mb of RAM. 

The method finds a satisfactory individual in around 500 iterations. The results 
models are shown in Figures 6, 8 and 10 while Figure 7, 9 and 1 1 shows the morphing 
process between these images 



6 Conclusions 

We developed a method to perform the automatic morphing of face images using 
evolutionary strategies and a generic face model. We do not need a training set of 
faces to obtain the model because we use a model based on a simple parameterized 
face model. The results are good although we worked with a simplified problem using 
only images of faces in frontal view and with uniform illumination. As to future work 
we plan to generalize the method working with images with non-uniform illumination 
or with rotated face images using, for example, symmetry-based algorithms [20, 21] 
to supply more information about the position of the face. 




Fig. 6. Final individuals over the source and target images 




Fig. 7. Resulting Face Image Morphing for images in Figure 6 
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Fig. 8. Example 2 







Fig. 9. Resulting Face Image Morphing for images in Figure 8 




Fig. 10. Example 3 



1 rr- -T*. J 











Fig. 11. Resulting Face Image Morphing for images in Figure 10 
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Abstract. In this paper, we present a parallel version of a multi-objective evolu- 
tionary algorithm that incorporates some coevolutionary concepts. Such an algo- 
rithm was previosly developed hy the authors. Two approaches were adopted to 
parallelize our algorithm (both of them based on a master-slave scheme): one uses 
Pthreads (shared memory) and the other one uses MPI (distributed memory). We 
conduct a small comparative study to analyze the impact that the parallelization 
has on performance. Our results indicate that both parallel versions produce im- 
portant improvements in the execution times of the algorithm (with respect to the 
serial version) while keeping the quality of the results obtained. 



1 Introduction 

The use of coevolutionary mechanisms has been scarce in the evolutionary multiob- 
jective optimization literature [1]. Coevolution has strong links with game theory and 
its suitability for the generation of “trade-offs” (which is the basis for multiobjective 
optimization) is, therefore, rather obvious. This paper extends our proposal for a coevo- 
lutionary multi-objective optimization approach presented in [2]. The main idea of our 
coevolutionary multi-objective algorithm is to obtain information along the evolutionary 
process as to subdivide the search space into n subregions, and then to use a subpopu- 
lation for each of these subregions. At each generation, these different subpopulations 
(which evolve independently using Fonseca & Fleming’s ranking scheme [3]) “cooper- 
ate” and “compete” among themselves and from these different processes we obtain a 
single Pareto front. The size of each subpopulation is adjusted based on their contribution 
to the current Pareto front (i.e., subpopulations which contributed more are allowed a 
larger population size and viceversa). The approach uses the adaptive grid proposed in 
[4] to store the nondominated vectors obtained along the evolutionary process, enforcing 
a more uniform distribution of such vectors along the Pareto front. 

This paper presents the hrst attempt to parallelize a coevolutionary multi-objective 
optimization algorithm. The main motivation for such parallelization is because the 
proposed algorithm is intended for real-world applications (mainly in engineering) and 
therefore, the availability of a more efficient version of the algorithm (in terms of CPU 
time required) is desirable. In this paper, we compare the serial version of our algorithm 
(as reported in [2]) with respect to two parallel versions (one that uses Pthreads and 
another one that uses MPI). A comparison with respect to PAES [4] is also included to 
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give a general idea of the performance of the serial version of our algorithm with respect 
to other approaches. However, for a more detailed comparative study the reader should 
refer to [2] . The main aim of this study is to compare the performance gains obtained by 
the parallelization of the algorithm. Such performance is measured both in terms of the 
computational times required as well as in terms of the quality of the results obtained. 

2 Statement of the Problem 



We are interested in solving problems of the type: 



minimize [/i(x), / 2 (x), . . . , fk{x)\ 


(1) 


subject to: 




9i{x) >0 i = 1, 2, . . . , TO 


(2) 


h^{x) = 0 i = 1,2, . . . 


(3) 


where k is the number of objective functions fi : i?" — 


R. We call x = 



[xi,X 2 , ■ ■ ■ ,Xn]^ the vector of decision variables. We thus wish to determine from 
the set T of all the vectors that satisfy (2) and (3) to the vector xj, . . . , x* that are 
Pareto optimal. We say that a vector of decision variables x* G iF is Pareto optimum 
if there does not exist another x & T such that fi{x) < fi{x*) for every i = 1, . . . , k 
and fj(x) < fj{x*) for at least one j. The vectors x* corresponding to the solutions 
included in the Pareto optimal set are called nondominated. The objective function val- 
ues corresponding to the elements of the Pareto optimal set are called the Pareto front 
of the problem. 

3 Coevolution 

Coevolution refers to a reciprocal evolutionary change between species that interact with 
each other. The relationships between the populations of two different species can be 
described considering all their possible types of interactions. Such interaction can be 
positive or negative depending on the consequences that such interaction produces on 
the population. Evolutionary computation researchers have developed several coevolu- 
tionary approaches in which normally two or more species relate to each other using 
any of the possible relationships, mainly competitive (e.g., [5]) or cooperative (e.g., [6]) 
relationships. Also, in most cases, such species evolve independently through a genetic 
algorithm. The key issue in these coevolutionary algorithms is that the fitness of an 
individual in a population depends on the individuals of a different population. 

4 Description of the Serial Version of Our Algorithm 

The main idea of our approach is to try to focus the search efforts only towards the 
promising regions of the search space. In order to determine what regions of the search 
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1. gen 


= 0 




2. populations — 


1 


3. while (gen < Gmax) { 


4. 


if(gera = 
{ 


Gmax /4 or Gmax /2 or 3 * Gmax j 4) 


5. 


check_active_populations() 


6. 




decision_variables_analysis() 
(compute number of subdivisions) 


7. 


} 

for (i = ] 


construct_new_subpopulations() 
(update populations) 


8. 


L; 2 < populations; i + +) 


9. 




if {population i contributes 
to the current Pareto front) 


10. 




evolve _and_compete( 2 ) 


11. 


elitismO 




12. 


reassign_resources() 


13. 


gen + + } 



Fig. 1. Pseudocode of our algorithm. 



space are promising, our algorithm performs a relatively simple analysis of the current 
Pareto front. The evolutionary process of our approach is divided in 4 stages. Our current 
version equally divides the full evolutionary run into four parts (i.e., the total number of 
generations is divided by four), and each stage is allocated one of these four parts. 
First Stage. During the first stage, the algorithm is allowed to explore all of the search 
space, by using a population of individuals which are selected using Fonseca and Flem- 
ing’s Pareto ranking scheme [3]. Additionally, the approach uses the adaptive grid pro- 
posed in [4]. At the end of this hrst stage, the algorithm analyses the current Pareto front 
(stored in the adaptive grid) in order to determine what variables of the problem are more 
critical. This analysis consists of looking at the current values of the decision variables 
corresponding to the current Pareto front (line 6, Figure 1). This analysis is performed 
independently for each decision variable. The idea is to determine if the values corre- 
sponding to a certain variable are distributed along all the allowable interval or if such 
values are concentrated on a narrower range. When the whole interval is being used, 
the algorithm concludes that keeping the entire interval for that variable is important. 
However, if only a narrow portion is being used, then the algorithm will try to identify 
portions of the interval that can be discarded from the search process. As a result of this 
analysis, the algorithm determines whether is convenient or not to subdivide (and, in 
such case, it also determines how many subdivisions to perform) the interval of a certain 
decision variable. Each of these different regions will be assigned a different population 
(line 7, Figure 1). 

Second Stage. When reaching the second stage, the algorithm consists of a certain 
number of populations looking each at different regions of the search space. At each 
generation, the evolution of all the populations takes place independently and, later on, 
the nondominated elements from each population are sent to the adaptive grid where they 
“cooperate” and “compete” in order to conform a single Pareto front (line 10, Figure 1). 
After this, we count the number of individuals that each of the populations contributed to 
the current Pareto front. Our algorithm is elitist (line 11, Figure 1), because after the first 
generation of the second stage, all the populations that do not provide any individual to 
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the current Pareto front are automatically eliminated and the sizes of the other populations 
are properly adjusted. Each population is assigned or removed individuals such that its 
hnal size is proportional to its contribution to the current Pareto front. These individuals 
to be added or removed are randomly generated/chosen. Thus, populations compete with 
each other to get as many extra individuals as possible. Note that it is, however, possible 
that the sizes of the populations “converge” to a constant value once their contribution 
to the current Pareto front does not change any longer. 

Third Stage. During the third stage, we perform a check on the current populations in 
order to determine how many (and which) of them can continue (i.e., those populations 
which continue contributing individuals to the current Pareto front) (line 5, Figure 1). 
Over these (presumably good) populations, we will apply the same process from the 
second stage (i.e., they will be further subdivided and more populations will be created 
in order to exploit these “promising regions” of the search space). In order to determine 
the number of subdivisions that are to be used during the third stage, we repeat the 
same analysis as before. The individuals from the “good” populations are kept. All the 
good individuals are distributed across the newly generated populations. After the hrst 
generation of the third stage, the elitist process takes place and the size of each population 
will be adjusted based on the same criteria as before. Note however, that we dehne a 
minimum population size and this size is enforced for all populations at the beginning 
of the third stage. 

Fourth Stage. During this stage, we apply the same procedure of the third stage in order 
to allow a hne-grained search. 

Decision Variables Analysis. The mechanism adopted for the decision variables analysis 
is very simple. Given a set of values within an interval, we compute both the minimum 
average distance of each element with respect to its closest neighbor and the total portion 
of the interval that is covered by the individuals contained in the current Pareto front. 
Then, only if the set of values covers less than 80% of the total of the interval, the 
algorithm considers appropriate to divide it. Once the algorithm decides to divide the 
interval, the number of divisions gets increased (without exceeding a total of 40 divisions 
per interval), as explained next. Let’s dehne range* as the percentage of the total of 
interval that is occupied by the values of the variable i. Let be the minimum 
average distance between individuals (with respect to the variable i) and let divisions'^ 
be the number of divisions to perform in the interval of the variable i\ 



if {range^ <.0.^*^intervaV) 

while <0.1^intervaV) 

{ divisions'' + +; interval' =0.2*interval ' } 



Parameters Required. Our proposed approach requires the following parameters: 

1. Crossover rate (pc) and mutation rate (pm)- 

2. Maximum number of generations (Gmax). 

3. Size of the initial population (popsizeinu) to be used during the hrst stage and 
minimum size of the secondary population (popsizesec) to be used during the further 
stages. 
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5 Description of the Parallelization Strategy 



The topology adopted in this work consisted of a master-slave scheme. As we indicated 
before, the evolutionary process of our algorithm is divided in 4 stages. Next, we will 
briefly describe the part of each of these stages that was parallelized. 

First Stage. In the case of Pthreads, this hrst stage is performed by the master thread. 
In the case of MPI, the corresponding work is performed by each of the slave processes. 
Since the master slave is the only one with access to the adaptive grid, upon finishing 
each generation, each slave process must send its full population to the master process. 
The master process receives all the populations and applies the corresponding biters to 
send the nondominated individuals (of each population) to the adaptive grid. 

Second Stage. From this stage, the algorithm uses a certain number of populations so 
that it can explore different regions of the search space. Thus, in the case of Pthreads, 
given a bxed number of threads, a dynamic distribution of the total number of popula- 
tion takes place: the threads evolve the next available population. At each generation, 
each thread evolves its corresponding populations and, then, it sends the nondominated 
individuals from each population to the adaptive grid (line 10, Figure 1). The grid access 
was implemented with mutual exclusion. After accessing the adaptive grid, the master 
thread is on charge of counting the number of individuals provided by each population 
to constitute the current Pareto front, and also on charge of reassigning the resources 
corresponding to each of the populations (lines 1 1 and 12, Figure 1). In the case of MPI, 
given a bxed number of slave processes, we assigned a bxed and equitative number of 
populations to each process. Once the master process has decided which populations 
will be assigned to each slave process, it proceeds to transfer them. In order to decrease 
the sending and/or reception of messages peer-to-peer between processes, we created 
buffers. Thus, each time that one or more full populations need to be sent or received, 
a buffer is created to pack (or receive) all the necessary information and later on, such 
information is sent (or unpacked). This is done with all the slaves, such that all can re- 
ceive their corresponding populations. Finally, each slave sends back all its populations 
to the master process, such that the master can use them in any procedures required. 
Third and Fourth Stages. The main mechanism of these stages, represented by lines 
4-7 in Figure 1 is performed by the master thread (process). Then, we continue with the 
evolutionary process and with the resources reassignment described in the second stage. 
Synchronization. In the case of Pthreads, at each generation we must wait until all the 
threads have bnished their corresponding evolutionary processes and grid accesses. Each 
bnished process waits until the continuation signal is received. Once all the processes 
have bnished, the last thread to arrive takes care of the necessary processes and when it 
bnishes, it sends the required signal to awaken all the other threads and continue with 
the evolutionary process. In the case of MPI, we use barriers at each generation for the 
synchronization. Such barriers are adopted after sending all the populations to all the 
slaves and after the reception of all the corresponding populations. This was done such 
that all the slaves could start the evolutionary process and corresponding evaluations of 
their individuals at the same time. 
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6 Comparison of Results 

The efficiency of a parallel algorithm tends to be measured in terms of its correctness and 
its speedup. The speedup (SP) of an algorithm is obtained by dividing the processing 
time of the serial algorithm (Tg) by the processing time of the parallel version (Tp): 
SP = Tg/Tp. In all the experiments performed, we used the following parameters for 
our approach: crossover rate (pc) of 0.8, mutation rate (pm) of 1/ codesize and size of the 
initial population (popsizemit) equal to the minimum size of the secondary population 
(popsizegec) = 20. The maximum number of generations (Gmax) was adjusted such 
that the algorithms always performed an average of 40000 fitness function evaluations 
per run. The maximum number of allowable populations was 200. The experiments took 
place on a PC with 4 processors. In order to give an idea of how good is the performance 
of the proposed algorithm, we will also include a comparison of results with respect to the 
Pareto Archived Evolution Strategy (PAES) [4] (PAES was run using the same number of 
fitness function evaluations as the serial version of our coevolutionary algorithm), which 
is an algorithm representative of the state-of-the-art in the area. To allow a quantitative 
comparison of results, the following metrics were adopted: 

Error Ratio (ER): This metric was proposed by Van Veldhuizen [7] to indicate the per- 
centage of solutions (from the nondominated vectors found so far) that are not members 
of the true Pareto optimal set: ER = (X)r=i where n is the number of vectors in 
the current set of nondominated vectors available; = 0 if vector i is a member of the 
Pareto optimal set, and Cj = 1 otherwise. 

Inverted Generational Distance (IGD): The concept of generational distance was 
introduced by Van Veldhuizen & Lamont [8,9] as a way of estimating how far are the 
elements in the Pareto front produced by our algorithm from those in the true Pareto front 
of the problem. This metric is defined as: GD = ( ^ where n is the number 
of nondominated vectors found by the algorithm being analyzed and di is the Euclidean 
distance (measured in objective space) between each of these and the nearest member of 
the true Pareto front. In our case, we implemented an “inverted” generational distance 
metric (IGD) in which we use as a reference the true Pareto front, and we compare each 
of its elements with respect to the front produced by an algorithm. 

Eor each of the examples shown below, we performed 30 runs per algorithm. The 
Pareto fronts that we will show correspond to the median of the 30 runs performed with 
respect to the ER metric. 



6.1 Test Function 1 

Table 1 shows the values of SP and the metrics ER and IGD for each of the versions 
compared. 



Min fi{xi,X 2 ) = xi, Min f 2 {xi,X 2 ) = (1-0 -f IO.OX 2 ) 

Xi 



( 1 . 0 - 



xj 



1.0 + 10 . 0 x 2 1.0 + 10.0.2 

0.0 < xi,X2 < 1.0 



( 4 ) 
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Table 1. Comparison of results for the first test function. SP refers to the speedup achieved. 



PAES 


Serial 


Pthreads 

5P=2.9541 


MPI 

SP=2.5185 


best 0.01 


0.22 


0.16 


0.09 


median 0.06 


0.40 


0.36 


0.39 


ER worst 0.12 


0.57 


0.57 


0.56 


average 0.057 


0.39 


0.37 


0.36 


std. dev. 0.0301 


0.1164 


0.1215 


0.1283 


best 0.001030 


0.000596 


0.000606 


0.000564 


median 0.001305 


0.000818 


0.000807 


0.000875 


IGD worst 0.003224 


0.003277 


0.003277 


0.002906 


average 0.001382 


0.001062 


0.001061 


0.001204 


std. dev. 0.000409 


0.000638 


0.000638 


0.000637 




Fig. 2. Pareto fronts obtained by the serial versions of PAES and our coevolutionary algorithm, 
the Pthreads version and the MPI version for the first test function. 
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Table 2. Comparison of results for the second test function. SP refers to the speedup achieved. 







PAES 


Serial 


Pthreads 

SP=2.2486 


MPI 

SP=2.5639 




best 


0.01 


0.06 


0.03 


0.07 




median 


0.11 


0.17 


0.16 


0.16 


ER 


worst 


1.0 


0.56 


0.56 


0.77 




average 


0.26 


0.14 


0.14 


0.14 




std. dev. 


0.3390 


0.1191 


0.1197 


0.1842 




best 


0.005430 


0.002321 


0.002611 


0.002882 




median 


0.009875 


0.003642 


0.003648 


0.003444 


IGD 


worst 


0.023626 


0.007876 


0.007876 


0.010387 




average 


0.010495 


0.003747 


0.003791 


0.003914 




std. dev. 


0.004310 


0.001035 


0.000984 


0.001431 



PAES 


Serial Version 




"paesf2.dat' - 






•sec2.daf * 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 




Pthreads Version 


MPI Version 




■ptti2.dat‘ * 






'mpi2.dat' * 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Fig. 3. Pareto fronts obtained by the serial versions of PAES and our coevolutionary algorithm, 
the Pthreads version and the MPI version for the second test function. 



7 Discussion of Results 

In the case of the first test function, when comparing results with respect to the ER metric, 
PAES considerably improves the results achieved by our coevolutionary approach in all 
of its versions. However, note that with respect to the I GD metric, our approach presents 
a better average performance than PAES. On the other hand, it is very important to note 
that, despite the fact that the use of Pthreads improves the results of the serial version 
(with respect to the ER metric), the MPI version does even better. With respect to the 
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IGD metric, the three versions of the coevolutionary algorithm have a similar behavior 
in all the test functions. In general, the results when using Pthreads are very similar than 
those obtained with the serial version. In contrast, the MPI version produces (marginally) 
better results than the other two versions in the case of the first function, and (marginally) 
poorer results in the case of the second function. 

In the second test function, our coevolutionary approach had a better average perfor- 
mance than PAES with respect to the ER metric. In fact, note that the worst solutions 
achieved by PAES totally missed the true Pareto front of the problem (therefore the 
value of 1.0 obtained). Regarding the IGD metric, the results of the three versions of 
our coevolutionary algorithm are approximately three times better than those obtained 
by PAES. In the case of the second test function, the mere use of Pthreads improves on 
the results obtained using the serial version of the algorithm (when measured with re- 
spect to the ER metric). On average, however, the results obtained by our MPI approach 
are of the same quality as those obtained with the serial version of the algorithm. 

In general, regarding the ER metric, the four algorithms reach the true Pareto front 
of each problem, and the first test function is the only one in which PAES is found to be 
superior to our approach (in any of its versions). However, regarding the IGD metric, we 
can see that the results of our coevolutionary algorithm are better than those obtained by 
PAES. Graphically, we can see that this is due to the fact that PAES has problems to cover 
the entire Pareto front of the problem. Regarding the speedups achieved, in the first test 
function, the Pthreads implementation was superior to the MPI version. In the second 
example, the best speedup was achieved by our MPI strategy. The speedup values that 
we obtained are considered acceptable if we take into account that the parallelization 
strategy adopted in this study is rather simple and does not adopt the best possible 
workload for the 4 processors available. 



8 Conclusions and Future Work 

We presented a simple parallelization of a coevolutionary multi-objective optimization 
algorithm. The main idea of our algorithm is to obtain information along the evolutionary 
process as to subdivide the search space into subregions, and then to use a subpopula- 
tion for each of these subregions. At each generation, these different subpopulations 
“cooperate” and “compete” among themselves and from these different processes we 
obtain a single Pareto front. The size of each subpopulation is adjusted based on their 
contribution to the current Pareto front. Thus, those populations contributing with more 
nondominated individuals have a higher reproduction probability. Three versions of the 
algorithm were compared in this paper: the serial version and two parallel versions: one 
using Pthreads and another one using MPI. We also included a comparison of results with 
respect to PAES to have an idea of the performance of the serial version of our algorithm. 
The results obtained indicate that, despite the simplicity of the parallel strategy that we 
implemented, the gains in execution time are considerably good, without affecting (in a 
significant way) the quality of the results with respect to the serial version. As part of our 
future work we are considering the use of a more efficient parallelization strategy that 
can improve the speedup values achieved in this paper. We are also considering certain 
structural engineering applications for our proposed approach. 
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Abstract. One of the distributed artificial intelligence objectives is the 
decentralization of control. Multi-agent architectures distribute the control 
among two or more agents, which will he in charge of different events. In this 
paper the design of a parallel Multi-agent architecture for genetic algorithms is 
described, using a bottom-up behavior design and reactive agents. Such design 
tries to achieve the improvement solution of parallel genetic algorithms. The 
purpose of incorporating a reactive behavior in the parallel genetic algorithms is 
to improve the overall performance by up-dating the sub-populations according 
to the general behavior of the algorithm avoiding getting stuck in local minima. 
Two kinds of experiments were conducted for each one of the algorithms and 
the results obtained with both experiments are shown. 



1 Introduction 

A kind of Multi-agent (MA) architecture is the one designed with reactive agents, 
which were first studied hy Brooks [1] in the early 80s. Their main characteristic is 
that they have no exhaustive representation of the real world. Their behavior is 
emergent, i. e., it is to a great extent dependent on the perception of the environment 
at a given instant. 

Reactive agents are also known as agents based on behavior. Their main features are: 
1) a constant interaction with the environment, and 2) a control mechanism allowing 
them to work with limited resources and incomplete information [3,4,6] 

The chief advantage of the use of reactive agents lies in the speed of adaptation to 
unforeseen situations This advantage has a high cost, since reactive design leaves a 
great deal of deliberative tasks to the designer. 

In this paper a proposal for the analysis and design of reactive agents based on the 
bottom-up design mechanism proposed by Maes [7] and Laureano, de Arriaga and 
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Garcia-Alegre [5] is developed, in order to improve on the results obtained in [10] by 
means of a parallel genetic algorithm whose domain is centered on the efficiency of 
the classification of a universal set in two independent subsets. The balance between 
the capacity to converge to an optimum and the ability to explore new regions is 
dictated by and the type of crossover employed. In this work p^ , p^ are varied 

adaptively occording to the fitness values of the solutions [9]. 

Such architecture is implemented by establishing a similarity between the desires 
expressed by Maes: eating, exploring and fleeing, with the desire to: interchange 
individuals and improve the quality of the solution. 



2 Design of the Reactive Genetic Algorithm 

2.1 Control Mechanism for a Reactive Agent 

Pattie Maes [7] proposes a reactive mechanism for choosing a given behavior among 
a set of them. 

Maes simulates the behavior of a not very complex artificial creature, which has a set 
of ten different behaviors: eating, drinking, exploring, going-towards-food, going- 
towards-water, sleeping, fighting, fleeing-from-creature, going-towards-creature, and 
dodging-obstacle. Besides, it has a set of seven motivations: thirst, hunger, sleepiness, 
fear, tiredness, safety and agressiveness. Such creatures are placed in a virtual 
environment, inside which can be found some elements such as: water, food, another 
creature, an obstacle. The quantities of those elements are variables. 

Each behavior is associated with: a) an activation level represented by a real number; 
b) a set of conditions, and c) a threshold, represented as well by a real number. For a 
given behavior to be executed, its activation level must surpass the threshold level, as 
well as fulfilling a number of conditions related to the environment. For instance: for 
the eating behavior to be executed, there must be food near the creature. 

As for behavior activation mechanisms, they are represented by 1) the creature’s 
desire to perform a given behavior; 2) the present situation: the environment 
influences the creature’s desires; 3) activation through the precedent: When a 
behavior is executed, the value of the activation level goes down to almost zero. The 
same happens to all the values of activation levels associated to the desires related to 
that behavior. 



2.2 Parallel Genetic Algorithm with Reactive Characteristics 

The Parallel Genetic Algorithm (PGA) performance is based on the interchange of 
individuals among different sub-populations. Plans best known for individuals 
interchange are migration and diffusion. Those mechanisms have as input data a 
previous knowledge of: 1) the generation, and 2) the sub-populations involved. 
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The purpose of incorporating a reactive behavior in the PGA, is improve the overall 
performance by up-dating the sub-populations according to the general behavior of 
the algorithm avoiding getting stuck in local minima. An emergent behavior is 
proposed that, as has been already stated, is the main feature of reactive agents. The 
Reactive Parallel Genetic Algorithm (RPGA) will be able to send and receive 
individuals from and to any other sub-population. The former, at the time required, by 
observing the stimuli presented by the environment at a given time, without affecting 
its performance. In what follows a more detailed description is made of the 
characteristics and elements that will substitute the traditional migration mechanism; 

1. A sub-population might send individuals to another sub -population in any 
generation,. There is no need for both sub-populations to be synchronized. The 
sender does not know in what generation, nor to whom will it send individuals, 
until the very moment they are sent. 

2. Due to the reactive migration plan, several cases can arise in the individuals 
interchange stages. These go from the minimal case in which in a given 
generation no population sends any individuals, to the maximum one in which, 
given a generation, all the populations send individuals to the n - \ remaining 
ones. All the subsets of ordered pairs of the set of sub-populations must be 
considered. 

3. In each generation, each genetic algorithm GA (sub-population) must verify if 
individuals from other sub-populations have arrived. The former due to the lack 
of synchronization and migration topology. For that reason, all the sub- 
populations, represented by their GA, must check a special buffer, which will be 
called the station. It is in this station where all arriving individuals are 
temporarily stored. If the station is empty, the GA continues on its task; if it is 
not, the individuals are extracted and added to the sub-population. 

4. The sending of individuals from one sub-population to another is determined by 
certain non-deterministic impulses assigned to each GA. Those impulses would 
be controlled by a mechanism similar to the one used by Maes [7] to simulate the 
behavior of simple creatures in a dynamic environment. The mechanism 
proposed here borrows some concepts from Maes, adding some new ones in 
order to be able to adapt it to the field of GAs. The new concepts are defined 
below. 

- Migration Limit: is a value representing the limit that the migration impulse 
must surpass for the GA to be able to send a group of individuals to some of 
the n - 1 remaining sub-populations. The migration limit can remain 
constant during the population evolution. 

- Migration Impulse: is a value representing the GA’s desire of sending 
individuals to a specific sub-population. It allows each GA to have a 
migration impulse for each one of the « - 1 remaining sub-populations. 
When this value surpasses the migration limit, the GA selects a group of 
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individuals and sends them to the corresponding suh-population. The 
migration impulse is cut down to 10% of its present value after they are sent. 

- Stimulus Range: is an interval comprising the allowed values of the stimulus 
in order to increment the migration impulse. In other words, all the GA (sub- 
populations) in each generation produce a random value for each one of the 
n- 1 remaining sub-populations. Such value is within the production range of 
the stimulus. The generated value will be called stimulus and it will be added 
to the present value of the migration impulse. 

Given the characteristics described above and knowing the optimal solution, The 
purpose is to keep a high diversity in the sub-populations to prevent premature 
convergence. 

An important aspect of the GA is the quality of the solution. For that reason, a 
mechanism is included in which each population requests to the n-l remaining sub- 
populations the sending of individuals. Such request will be made based on the time 
elapsed from the last time the former best solution was improved. The time interval 
(number of generations) elapsed without improvement will be called improvement 
limit. 

When in a GA a number of generations have passed and the improvement limit did not 
change, it sends a message to the other n-l GAs. This message is interpreted as a 
request for individuals from the sender. The former will allow the rest of the GAs 
(sub-populations) to send their own individuals. Once they are sent, the value of the 
migration impulse of the receiving sub-population is decreased. The value of the 
migration impulse is stored in each sending sub-population. The decrease of that 
value is in order to prevent such sub-population from receiving individuals on a short 
period of time. 

All these characteristics, together in the same GA, constitute what we hereafter will 
call Reactive Parallel Genetic Algorithm (RPGA), which is presented in Fig. 1. 

As can be perceived in the Fig. 1, the RPGA works in the following fashion: initially, 
it generates randomly a population of individuals, and enters the evolution cycle of 
that population, where for each iteration (generation) it finds out the aptitude of the 
individuals. It chooses the individuals that will survive, according to the selection 
plan used [2]. Individuals are crossed and mutated. Later, starts the reactivity stage, 
where, to begin with, the value of the migration impulse of each remaining sub- 
population is increased. A check is made to see if an individuals’ request have 
arrived, in which case the value of the migration impulse is increased again in each 
sub-population that sent a request. The value of each migration impulse is looked 
over, and individuals are sent to the sub-populations whose migration impulses are 
greater than the migration limit. 

Later on, the station is verified to look for individuals that just arrived; if they did, 
those individuals are added to the sub-population; otherwise, the iterative process 
continues. The last step is to verify which one is the best element found in this 
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Generate Initial Population 

Do While Population Evolution Criterion not Fulfilled 
Find Aptitude 
Select Individuals 
Cross Individuals 
Mutate Individuals 
Increment Migration Impulse 
If (Requests Arrived), Then 

Increment Migration Impulse 

End If 

If (Impulse > Limit), Then 
Send Individuals 

End If 

If (individuals arrived). Then 

Add Individuals to Population 

End If 

If (solution not Improved in given time) Then 
Send Individuals Request 

End If 

End Do 



Fig. 1. Reactive Genetic Algorithm 

generation. If that element is better than the global optimum, the former is substituted 
by the new best element. Otherwise, a counter is incremented. If the value of that 
counter is greater than or equal to the improvement limit, then the GA sends an 
individuals request to the other GAs. This cycle is repeated until the result complies 
with the convergence criterion or with the number of iterations agreed upon. 

Communication between sub-populations is represented by sending and receiving 
individuals. This is accomplished by perceiving the local environment through three 
independent agents in each sub-population. Such perception is meant to find stimuli 
allowing them to communicate with the n-\ sub-populations. The interaction between 
agents is local for the sub-population and its purpose is to reach its goals: request, 
sending and reception of individuals. Fig. 2 shows the types of communication, 
considering two agents; that is, two populations. 



3 Experiments 

3.1 Objective Function 

For this research we had available a set of 201 samples, belonging to a universal set, 
of whom 49 belong to Group 1, and 152 to Group 2. The comparison between genetic 
algorithms was carried out by measuring the effectiveness of each one, in order to 
classify accurately each individual in its proper group. 
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Fig. 2. Types of Communication using 2 agents 



An objective function was designed for this purpose, determining the number of 
successes for each individual in the population, in such way that the best individual 
will be the one having 201 successes, or 100% effectiveness, while the worst would 
be the one with 0 successes, or 0% effectiveness when classifying: 



where: 



, . n m 

/W = CZa,. + Zfc, 

i=l j=l 



( 1 ) 



It is the number of samples from group 1 . 
m is the number of samples from group 2. 

{ 1 if classifies correctly sample i from group 1 
0 otherwise 



1 1 if classifies correctly sample j from group 2 
[O otherwise 



C: a constant, used to assign a greater weight to successes in samples 
from group 1 . 



In this work two types of mutation were used: static and dynamic. The dynamic one 
was defined in a similar way to the one used by Pal and Wang [10] where individuals 
whose fitness is equal to the best in generation have a = 0. This avoid the mutation 
of high quality individuals giving as a result a high probability of premature 
convergence. Therefore in this case a more aggressive mutation rate can be used, that 
is, 
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= K for A , <= (3) 

where: 

= mutation probability. 

^mai = aptitude of the best individual of the population in that 
generation. 

Aavg = average aptitude of all individuals of the population in the 
generation. 

A_jj = aptitude of the individual for which p_^ is being estimated, 
k,'” = 0.5 

k, = 0.25 



3.2 Parameters and Configuration 

The RPGA uses a maximum of 2000 generations. The universal stochastic selection 
method of reproduction is applied. The crossing mechanism is a discrete one with a 
probability of 1.0. The mutation mechanism is based on the one proposed by 
Miihlenbein and Schilerkamp-Voosen [8] and the mutation probability is dynamic, 
based on the work of Pal and Wang [9]. 

A real codification is used for the chromosome structure, since the alleles of each 
individual have values in the interval [0, 1]. An adjusted aptitude plan is used to map 
the aptitudes of the individuals as values in the interval [0, 1]. 



Table 1. Results of tests for parallel and reactive genetic algorithms when using static 
mutation. 





Parallel Genetic Algorithm | 


1 Reactive Genetic AIbo 


irithm | 


Run 


Time 


Aptitude 


Effect 


Time 


Aptitude 


Effect 


1 


370 


0.9813 


93.58 


409 


0.9798 


93.08 


2 


372 


0.9829 


94.06 


407 


0.9829 


94.06 


3 


368 


0.9844 


94.56 


410 


0.9829 


94.06 


4 


369 


0.9829 


94.06 


412 


0.9813 


93.58 


5 


375 


0.9782 


92.58 


413 


0.9798 


93.08 


6 


375 


0.9813 


93.58 


41 1 


0.9844 


94.56 


7 


389 


0.9798 


93.08 


408 


0.9813 


93.58 


8 


371 


0.9782 


92.58 


407 


0.9844 


94.56 


9 


372 


0.9798 


93.08 


410 


0.9844 


94.56 


10 


371 


0.9844 


94.56 


41 5 


0.9829 


94.06 


Average 


373.2 


0.9813 


93.57 


410.2 


0.9824 


93.92 
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Table 2. Results of tests for parallel and reactive genetic algorithms when using dynamic 
mutation. 





Parallel Genetic AlBorithm I 


Reactive Genetic Alaorithm 


Run 


Time 


Aptitude 


Effect 


Time 


Aptitude 


Effect 


1 


670 


0.9829 


94.06 


707 


0.9860 


95.06 


2 


679 


0.9875 


95.55 


710 


0.9891 


96.04 


3 


673 


0.9860 


95.06 


71 1 


0.9829 


94.06 


4 


672 


0.9813 


93.58 


709 


0.9844 


94.56 


5 


670 


0.9860 


95.06 


710 


0.9891 


96.04 


6 


669 


0.9860 


95.06 


708 


0.9813 


93.58 


7 


671 


0.9829 


94.06 


700 


0.9891 


96.04 


8 


675 


0.9813 


93.58 


710 


0.9860 


95.06 


9 


679 


0.9829 


94.06 


712 


0.9875 


95.55 


10 


673 


0.9829 


95.55 


71 5 


0.9891 


96.04 


Average 


673.1 


0.9840 


94.56 


709.2 


0.9864 


95.20 



The RPGA has available a population of 800 individuals, the best results were 
obtained when it was divided into 8 sub-populations of 100 each. Members of the 
population migrate using the reactive mechanism with a migration percentage of 10%. 
The best individuals of each population are selected to migrate to others, and the 
incoming individuals substitute the worst ones. 

In order to implement a reactive migration scheme, several runs were conducted. The 
best results were obtained for migration rates of: 10, 15 and 20, to save 
communications we choose 20 as migration limit. Therefore, migration impulses are 
set to zero and progressively increased in each generation randomly, with a stimulus 
value included in the interval [0.0, 2.0]. The improvement limit has a value of 20 
generations. 



3.3 Experiments and Results 

Two kinds of experiments were performed for each one of the algorithms: a) a 
parallel, and b) a reactive one. The first experiment was made using static mutation, 
and the second one, dynamic. For each algorithm 10 trials were made. At the end of 
those experiments information was obtained about: 1) best percentage of 

classification, and 2) the classification average. The results obtained with both 
experiments are shown above (Table 1 and Table 2). 

As can be seen in the Table 1, the reactive genetic algorithm obtained a classification 
average slightly higher than the parallel one. Even so, both algorithms achieved a best 
classification percentage of 94.56%. 















706 



A.L. Laureano-Cruces et al. 



Table 2, shows that the reactive genetic algorithm obtained a classification average of 
95.20%, while the parallel one is slightly lower, at 94.56%. On the other hand, the 
reactive algorithm achieved classification with an effectiveness of 96.04% while the 
parallel one reached 95.55%. 



4 Conclusions 

In this paper, a migration mechanism for genetic algorithms controlled by reactive 
agents is presented. Such agents were inspired by the bottom-up design of the 
behavior proposed by Maes [7] and Laureano et al. [5]. 

The results obtained show that both algorithms present their best results when 
dynamic mutation is used, the RGA being the best for classifying, with a mean 
effectiveness percentage of 95.20%, while the PGA reaches 94.55%. 

In that way the RGA can be considered as an efficient method for the solution of 
complex problems. In the case we deal with, slightly better results are found from 
those of the PGA. Even if the computing time of the RGA is larger, it has however a 
practical advantage, a higher classification average. 
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Abstract. In this paper, we propose a differential evolution algorithm to solve 
constrained optimization problems. Our approach uses three simple selection cri- 
teria based on feasibility to guide the search to the feasible region. The pro- 
posed approach does not require any extra parameters other than those normally 
adopted by the Differential Evolution algorithm. The present approach was vali- 
dated using test functions from a well-known benchmark commonly adopted to 
validate constraint-handling techniques used with evolutionary algorithms. The 
results obtained by the proposed approach are very competitive with respect to 
other constraint-handling techniques that are representative of the state-of-the-art 
in the area. 



1 Introduction 

Evolutionary Algorithms (EAs) are heuristics that have been successfully applied in 
a wide set of areas [1,2], both in single and in multiobjective optimization. However, 
EAs lack a mechanism able to bias efficiently the search towards the feasible region 
in constrained search spaces. This has triggered a considerable amount of research and 
a wide variety of approaches have been suggested in the last few years to incorporate 
constraints into the fitness function of an evolutionary algorithm [3,4]. 

The most common approach adopted to deal with constrained search spaces is the use 
of penalty functions. When using a penalty function, the amount of constraint violation is 
used to punish or “penalize” an infeasible solution so that feasible solutions are favored 
by the selection process. Despite the popularity of penalty functions, they have several 
drawbacks from which the main one is that they require a careful fine tuning of the 
penalty factors that accurately estimates the degree of penalization to be applied as to 
approach efficiently the feasible region [5,3]. 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 707-716, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




708 



E. Mezura-Montes, C.A. Coello Coello, and E.I. Tun-Morales 



Differential Evolution (DE) is a relatively new EA proposed by Price and Storn [6]. 
The algorithm is based on the use of a special crossover-mutation operator, based on the 
linear combination of three different individuals and one subject-to-replacement parent. 
The selection process is performed via deterministic tournament selection between the 
parent and the child created by it. However, as any other EA, DE lacks a mechanism to 
deal with constrained search spaces. 

The constraint-handling approach proposed in this paper relies on three simple se- 
lection criteria based on feasibility to bias the search towards the feasible region. We 
have used the same approach implemented on different types of Evolution Strategies 
in which the results were very promising [7,8]. The main motivation of this work was 
to analyze if the use of the selection criteria that we successfully adopted in evolution 
strategies would also work with differential evolution. This is an important issue to us, 
because it has been hypothesized in the past that evolution strategies are a very pow- 
erful search engine for constrained optimization when dealing with real numbers [9]. 
However, no such studies exist for differential evolution nor other related heuristics that 
operate on real numbers (as evolution strategies). We thus believe that the search power 
of other heuristics such as differential evolution has been underestimated and therefore 
our interest in analyzing such search power. 

The paper is organized as follows: In Section 2, the problem of our interest is stated. 
In Section 3 we describe the previous work related with the current algorithm. A detailed 
description of our approach is provided in Section 4. The experiments performed and 
the results obtained are shown in Section 5 and in Section 6 we discuss them. Einally, 
in Section 7 we establish some conclusions and we define our future paths of research. 

2 Statement of the Problem 

We are interested in the general nonlinear programming problem in which we want 
to:Eind x which optimizes f{x) subject to: gi{x) <0, i = 1, . . . , n hj{x) = 0, j = 
1, . . . ,p where x is the vector of solutions x = [xi, X 2 , • • • , XrY' , n is the number 
of inequality constraints and p is the number of equality constraints (in both cases, 
constraints could be linear or nonlinear). If we denote with T to the feasible region and 
with S to the whole search space, then it should be clear that T QS. Eor an inequality 
constraint that satisfies gi{x) = 0, we will say that is active at x. All equality constraints 
hj (regardless of the value of x used) are considered active at all points of T. 

3 Previous Work 

DE is a population-based evolutionary algorithm with an special recombination operator 
that performs a linear combination of a number of individuals (normally three) and one 
parent (which is subject to be replaced) to create one child. The selection is deterministic 
between the parent and the child. The best of them remain in the next population. DE 
shares similarities with traditional EAs. However it does not use binary encoding as 
a simple genetic algorithm [2] and it does not use a probability density function to 
self-adapt its parameters as an Evolution Strategy [10]. The main differential evolution 
algorithm [6] is presented in Eigure 1 . 
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Begin 

G=0 

Create a random initial population Xq Vi, i = 
Evaluate /(jc^) \fi, i — 1, .. . , NP 
For G=1 to MAX.GENERATIONS Do 
For i=l to NP Do 

Select randomly n ^ T 2 ^ : 

jrand = randint(l, D) 

For j=l to D Do 

If (randj [0,1) < CR or j — jr 



,NP 



Else 



“j,G + l 



"i.G 



^J,G 



„d) Then 

^3,g> 



“],G+i — “},G 

End If 
End For 

If(/(«^+i) <^/(at^))Then 

*^G + l = ’’^G + 1 
Else 

^G+l ~ ^G 

End If 
End For 



G = G + 1 

End Eor 
End 



Fig. 1. DE algorithm. randintfmin,max) is a function that returns an integer number between min 
and max. rand[0, 1) is a function that returns a real number between 0 and 1. Both are based on a 
uniform probability distribution. “NP”, “MAX_GENERATIONS”, “CR” and “E” are user-defined 
parameters. 



The use of tournament selection based on feasibility rules has been explored by 
other authors. Jimenez and Verdegay [11] proposed an approach similar to a min-max 
formulation used in multiobjective optimization combined with tournament selection. 
The rules used by them are similar to those adopted in this work. However, Jimenez and 
Verdegay’s approach lacks an explicit mechanism to avoid the premature convergence 
produced by the random sampling of the feasible region because their approach is guided 
by the first feasible solution found. Deb [12] used the same tournament rules previously 
indicated in his approach. However, Deb proposed to use niching as a diversity mech- 
anism, which introduces some extra computational time (niching has time-complexity 
0{N'^)). In Deb’s approach, feasible solutions are always considered better than in- 
feasible ones. This contradicts the idea of allowing infeasible individuals to remain in 
the population. Therefore, this approach will have difficulties in problems in which the 
global optimum lies on the boundary between the feasible and the infeasible regions. 
Coello & Mezura [13] used tournament selection based on feasibility rules. They also 
adopted nondominance checking using a sample of the population (as the multiobjective 
optimization approach called NPGA [14]). They adopted a user-defined parameter Sr , to 
control the diversity in the population. This approach provided good results in some well- 
known engineering problems and in some benchmark problems, but presented problems 
when facing high dimensionality [13]. 

Some previous approaches have been proposed to solve constrained optimization 
problems using DE. Storn [15] proposed an adaptive mechanism that relaxes the con- 
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straints of the problem in order to make all the initial solutions feasible. This pseudo- 
feasible region is shrunk each generation until it matches the real feasible region. Also, 
Storn [15] proposed to use an aging concept in order to avoid that a solution remains in the 
population too many generations. Furthermore, he modified the original DE algorithm 
because when a child is created and it is not better than the parent subject- to-replace, 
another child is created. The process is repeated NT times. If the parent is still better, 
the parent remains in the population. Both, the aging parameter and NT are defined by 
the user. Storn [15] used a modified “DE/rand/l/bin” version. The approach showed a 
good performance in problems with only inequality constraints but presented problems 
when dealing with equality constraints. Moreover, only two test functions (out of seven 
used to test the approach) are included in the well-known benchmark for constrained 
optimization proposed by Koziel & Michalewicz [16] and enriched by Runarsson & 
Yao [9]. The main drawback of the approach is that it adds two user-defined parameters 
and that the NT parameter can cause an increase in the number of evaluations of the 
objective function without any user control. 

Lampinen & Zelinka [17] used DE to solve engineering design problems. They opted 
to handle constraints using a static penalty function approach that they called “Soft - 
constraint”. The authors tested their approach using three well-known engineering design 
problems [17]. They compared their results with respect to several classical techniques 
and with respect to some heuristic methods. The main drawback of the approach is the 
careful tuning required for the penalty factors which is in fact mentioned by the authors 
in their article. The last two methods discussed also lack of a mechanism to maintain 
diversity (to have both, feasible and infeasible solutions in the population during all 
the evolutionary process), which is one of the most important aspects to consider when 
designing a competitive constraint-handling approach [8]. 

4 Our Approach 

The design of our approach is based on the idea of preserving the main DE algorithm and 
just adding a simple mechanism, which has been found to be successful with other EAs. 
Moreover, our constraint-handling approach does not add any extra parameter defined 
by the user (other than those required by the original DE algorithm). 

The modifications made to the original DE are the following: 

1. The simple mechanism to deal with constraints are three simple selection criteria 
which guide the algorithm to the feasible region of the search space: 

- Between 2 feasible solutions, the one with the highest fitness value wins. 

- If one solution is feasible and the other one is infeasible, the feasible solution 
wins. 

- If both solutions are infeasible, the one with the lowest sum of constraint vio- 
lation is preferred. 

These criteria are applied when the child is compared against the parent subject to 
be replaced. 

2. In order to accelerate the convergence process, when a child replaces its parent, it 
is copied into the new generation but it is also copied into the current generation. 
The goal of this change is to allow the new child, which is a new and better solution. 
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to be selected among the three solutions (ri, r 2 or rs) and contribute to create 
better solutions. In this way, a promising solution does not need to wait for the next 
generation to share its genetic code. 

3. When a new decision variable of the child is created and it is out of the limits 
established (lower and upper) by an amount, this amount is subtracted or added 
to the limit violated to shift the value inside the limits. If the shifted value is now 
violating the other limit (which may occur), as a last option, a random value inside 
the limits is generated. 

Our proposed version of the DE algorithm, called CHDE (Constraint Handling Dif- 
ferential Evolution) is shown in Figure 2. 



Begin 

G=0 

Create a random initial population scq Vi, i — 1, .. . 
Evaluate /{xq) Vi, i = 1, . . . , NP 
For G=1 to MAX.GENERATIONS Do 
For i=l to NP Do 

Select randomly ri ^ r 2 ^ : 

jrand — randint(l, D) 

For j=l to D Do 

U(randj[0,l) < CRorj =>and)Then 

«*.G+i = 

Else 



NP 



- 






^j,G + l 

End If 






End For 

If (iA^_|_;^is better than Xq (based on the three selection criteria)) Then 



Else 



End If 



- “G+l 
^G + 1 



End For 

G - G + 1 

End For 
End 



Fig. 2. CHDE algorithm. The modified steps are marked with an arrow. randint(min,max) is a 
function that returns an integer number between min and max. rand[0, 1) is a function that returns 
a real number between 0 and 1. Both are based on a uniform probability distribution. “NP”, 
“MAX_GENERATIONS”, “CR” and “F” are user-defined parameters 



5 Experiments and Results 

To evaluate the performance of the proposed approach we used the 13 test functions 
described in [9]. The test functions chosen contain characteristics that are representative 
of what can be considered “difficult” global optimization problems for an evolutionary 
algorithm. Their expressions can be found in [9]. 
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To get a measure of the difficulty of solving each of these problems, a p metric 
(as suggested by Koziel and Michalewicz [16]) was computed using the following ex- 
pression: p = |F|/|S'|, where |F| is the number of feasible solutions and [S'! is the 
total number of solutions randomly generated. In this work, S = 1,000,000 random 
solutions. 



Table 1. Values of p for the 13 test problems chosen. 



Problem 


n 


Type of function 


P 


LI 


NI 


LE 


NE 


gOl 


I3 


quadratic 


0.0003% 


T 


~0 


0 


0 


g02 




nonlinear 


99.9973% 


2 


~0 


0 


0 


g03 


10 


nonlinear 


0.0026% 


0 


0 


0 


1 


g04 


5 


quadratic 


27.0079% 


4 


2 


0 


0 


g05 


4 


nonlinear 


0.0000% 


2 


0 


0 


3 


g06 


2 


nonlinear 


0.0057% 


0 


2 


0 


0 


g07 


To 


quadratic 


0.0000% 


T 




0 


0 


g08 


T 


nonlinear 


0.8581% 


“0“ 


~Y 


0 


0 


g09 


T 


nonlinear 


0.5199% 


“0“ 




0 


0 


glO 


T 


linear 


0.0020% 


~6 


~0 


0 


0 


gll 


2 


quadratic 


0.0973% 


0 


0 


0 


1 


gl2 


3 


quadratic 


4.7697% 


0 


W 


0 


0 


gl3 


5 


nonlinear 


0.0000% 


“0“ 


0 


1 


2 



The different values of p for each of the functions chosen are shown in Table 1 , where 
n is the number of decision variables, LI is the number of linear inequalities, NI the 
number of nonlinear inequalities, LE is the number of linear equalities and NE is the 
number of nonlinear equalities. 

We performed 30 independent runs for each test function. Equality constraints were 
transformed into inequalities using a tolerance value of 0.0001 (except for problems g03, 
gll and gl3 where the tolerance was 0.001). The parameters used for the CHDE are 
the following: NP = 60, MAX .GENERATIONS = 5, 800. To ensure that there is 
no sensitivity to “F” and “CR” parameters, E was generated randomly (using a uniform 
distribution) per run between [0.3, 0.9] and CR was also randomly generated between 
[0.8, 1.0]. The intervals for both parameters were defined empirically. 

The results obtained with the CHDE are presented in Table 2. A comparison of the 
performance of CHDE with respect to three techniques that are representative of the 
state-of-the-art in the area: the Homomorphous maps [16], Stochastic Ranking [9] and 
the Adaptive Segregational Constraint Handling Evolutionary Algorithm (ASCHEA) 
[18] are presented in Tables 3, 4 and 5, respectively. 

6 Discussion of Results 

As can be seen in Table 2, CHDE could reach the global optimum in the 1 3 test problems. 
The apparent improvement to the optimum solutions (or the best-known solutions) for 
problems g03, g05, gll and gl3 is due to the tolerance value adopted for the equal- 
ity constraints. However, the statistical measures suggest that the proposed approach 
presents premature convergence in some cases. This seems to be originated by the high 
selection pressure provided by the deterministic selection. It also causes that infeasible 
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Table 2. Statistical results obtained by the CHDE for the 13 test functions with 30 independent 
runs. 



Problem 


Statistical Results of the CHDE Algorithm | 


Optimal 


Best 


Mean 


Median 


Worst 


St. Dev. 


gOl 


-15 


-15.000 


-14.792134 


-15.000 


-12.743044 


0.401 


g02 


0.803619 


0.803619 


0.746236 


0.800445 


0.302179 


0.081 


g03 


1 


1.00 


0.640326 


0.702939 


0.029601 


0.239 


g04 


-30665.539 


-30665.539 


-30592.154435 


-30665.539 


-29986.214382 


108.779 


g05 


5126.498 


5126.496714 


5218.729114 


5231.557639 


5502.410392 


76.422 


g06 


-6961.814 


-6961.814 


-6367.575424 


-6961.814 


-2236.950336 


770.803 


g07 


24.306 


24.306 


104.599221 


24.482980 


1120.541494 


176.761 


g08 


0.095825 


0.095825 


0.091292 


0.095825 


0.027188 


0.012 


g09 


680.63 


680.6300 


692.472322 


680.639178 


839.782911 


23.575 


gio 


7049.25 


7049.248021 


8442.656946 


7137.415303 


15580.370333 


2186.49 


gll 


0.75 


0.749 


0.761823 


0.749 


0.870984 


0.020 


gl2 


1 


1 


1 


1 


1 


0 


gl3 


0.053950 


0.053866 


0.747227 


0.980831 


2.259875 


0.313 



Table 3. Comparison of our approach (CHDE) with respect to the Homomorphous Maps (HM) 
NA = Not Available. 



1 1 




Best Result 


Mean Result 


Worst Result 




Uptimal 


CHDE 


HM 




HM 




HM 




-15 


-15.000 


-14.7886 




-14.7082 




-14.6154 








0.79953 


0.746236 


0.79671 


0.302179 


0.79119 




1 1 1 


1 1.00 1 


0.9997 


0.640326 


0.9989 


0.029601 


0.9978 


g04 






-30664.5 


-30592.154435 


-30655.3 


-29986.214382 


-30645.9 


g05 


5126.498 


5126.496714 


- 


5218.729114 


- 


5502.410392 


- 


g06 


-6961.814 


-6961.814 


-6952.1 


-6367.575424 


-6342.6 


-2236.950336 


-5473.9 


g07 


24.306 


24.306 


24.620 


104.599221 


24.826 


1120.541494 


25.069 


g08 


0.095825 


0.095825 


0.0958250 


0.091292 


0.0891568 


0.027188 


0.0291438 


g09 


680.63 


680.6300 


680.91 


692.472322 


681.16 


839.782911 


683.18 


gio 


7049.25 


7049.248021 


7147.9 


8442.656946 


8163.6 


15580.370333 


9659.3 


gll 


0.75 


0.749 


0.75 


0.761823 


0.75 


0.870984 


0.75 


gl2 


1 


1 


0.999999857 


1 


0.999134613 


1 


0.991950498 


gl3 


0.053950 


0.053866 


NA 


0.747227 


NA 


2.259875 


NA 



Table 4. Comparison of our approach (CHDE) with respect to the Stochastic Ranking (SR) 







Best Result 


Mean Result 


Worst Result | 








Sk 


CHDE 


SR 




Sk 




-15 


-15.000 


-15.000 


-14.792134 


-15.000 


-12.743044 


-15.000 


g02 


0.803619 


0.803619 


0.803515 


0.746236 


0.781975 


0.302179 




g03 


1 


1.00 


1.000 


0.640326 


1.000 


0.029601 


1 1.000 1 


g04 


-30665.539 


-30665.539 


-30665.539 


-30592.154435 


-30665.539 


-29986.214382 




g05 


5126.498 


5126.496714 


5126.497 


5218.729114 


5128.881 


5502.410392 




g06 


-6961.814 


-6961.814 


-6961.814 


-6367.575424 


-6875.940 


-2236.950336 


^ 1 1 |iB 1 1 1 


g07 


24.306 


24.306 


24.307 


104.599221 


24.374 


1120.541494 


1 24.642 1 


g08 


0.095825 


0.095825 


0.095825 


0.091292 


0.095825 


0.027188 




g09 


680.63 


680.6300 


680.630 


692.472322 


680.656 


839.782911 




gio 


7049.25 


7049.248021 


7054.316 


8442.656946 


7559.192 


15580.370333 


8835.655 


gll 


0.75 


0.749 


0.750 


0.761823 


0.750 


0.870984 


0.750 


gl2 


1 


1 


1 


1 


1 


1 


1 


gl3 


0.053950 


0.053866 


0.053957 


0.747227 


0.057006 


2.259875 


0.216915 



solutions close to the boundaries of the feasible region do not remain in the population. 
Therefore, our CHDE requires a diversity mechanism (i.e., some infeasible solutions 
must remain in the population to avoid premature convergence) that does not increase 
its computational cost in a signihcant way. 

With respect to the three state-of-the-art approaches, some facts require discussion: 
With respect to the Homomorphous Maps [16], our approach obtained a better “best” 
solution in nine problems (gOl, g02, g03, g05, g06, g07, g09, glO and gl2) and a similar 
“best” results in other three (g04, h08 and gl 1). Also, CHDE provided a better “mean” 
result in five problems (gOl, g05, g06, g08 and gl2) and a better “worst” result for two 
problems (g05 and g 12). It is clear that CHDE was superior in quality of results than the 
Homomorphous Maps and it was competitive based on statistical measures. 





714 



E. Mezura-Montes, C.A. Coello Coello, and E.I. Tun-Morales 



Table 5. Comparison of our approach (CHDE) with respect to the Adaptive Segregational Con- 
straint Handling Evolutionary Algorithm (ASCHEA). N A = Not Available. 







Best Result 


Mean Result 


Worst Result 


Problem 


Optimal 


CHDE 


ASCHEA 


CHDE 


ASCHEA 


CHDE 


ASCHEA 


gOl 


-15 


-15.000 


-15.0 


-14.792134 


-14.84 


-12.743044 


NA 


g02 


0.803619 


0.803619 


0.785 


0.746236 


0.59 


0.302179 


NA 


g03 


1 


1-00 


1.0 


0.640326 


0.99989 


0.029601 


NA 


g04 


-30665.539 


-30665.539 


30665.5 


-30592.154435 


30665.5 


-29986.214382 


NA 


g05 


5126.498 


5126.496714 


5126.5 


5218.729114 


5141.65 


5502.410392 


NA 


g06 


-6961-814 


-6961.814 


-6961.81 


-6367.575424 


-6961.81 


-2236.950336 


NA 


g07 


24.306 


24.306 


24.3323 


104.599221 


24.66 


1120.541494 


NA 


g08 


0.095825 


0.095825 


0.095825 


0.091292 


0.095825 


0.027188 


NA 


g09 


680.63 


680.6300 


680.630 


692.472322 


680.641 


839.782911 


NA 


gio 


7049.25 


7049.248021 


7061.13 


8442.656946 


7193.11 


15580.370333 


NA 


gll 


0.75 


0.749 


0.75 


0.761823 


0.75 


0.870984 


NA 


gl2 


1 


1 


NA 


1 


NA 


1 


NA 


gl3 


0.053950 


0.053866 


NA 


0.747227 


NA 


2.259875 


NA 



With respect to the Stochastic Ranking [9], CHDE was able to find a better “best” 
result in three problems (g02, g07 and glO) and a similar “best” result in the remaining 
ten problems (gOl, g03, g04, g05, g06, g08, g09, gll, gl2 and gl3). Besides these, our 
approach got a similar “mean” and “worst” result for problem gl2. CHDE found either 
similar or best quality results than the Stochastic Ranking, which is one of the most 
competitive approaches for evolutionary constrained optimization. However, SR is still 
more robust than CHDE. This is because SR has a good mechanism to maintain diversity 
in the population (keep both, feasible and infeasible solutions during all the process). 

With respect to the Adaptive Segregational Constraint Handling Evolutionary Al- 
gorithm (ASCHEA) [18], our approach found better “best” results in three problems 
(g02, g07 and glO) and a similar “best” in eight functions (gOl, g03, g04, g05, g06, 
g08, g09 and gll). Finally, CHDE could find a better “mean” result in problem g02. 
Our approach showed a competitive performance based on quality and showed some 
robustness compared to ASCHEA. However, the analysis was incomplete because the 
worst results found by ASCHEA were not available. 

From the previous comparison, we can see that the CHDE produced competitive 
results based on quality with respect to three techniques representative of the state-of- 
the-art in constrained optimization. CHDE can deal with highly constrained problems, 
problems with low (g06 and g08) and high (gOl, g02, g03, g07) dimensionality, with 
different types of combined constraints (linear, nonlinear, equality and inequality) and 
with very large (g02) or very small (g05, gl3) or even disjoint (gl2) feasible regions. 
However, our approach presented some robustness problems and more work is required 
in that direction. 

It is worth emphasizing that CHDE does not require additional parameters. In con- 
trast, the Homomorphous Maps require an additional parameter (called v) which has 
to be found empirically [16]. Stochastic ranking requires the definition of a parameter 
called Pf, whose value has an important impact on the performance of the approach [9]. 
ASCHEA also requires the definition of several extra parameters, and in its latest version, 
it uses niching, which is a process that also has at least one additional parameter [18]. 

Measuring the computational cost, the number of fitness function evaluations (FEE) 
performed by our approach is lower than the other techniques with respect to which it was 
compared. Our approach performed 348,000 FEE. Stochastic ranking performed 350,000 
FEE, the Homomorphous Maps performed 1,400,000 FEE, and ASCHEA performed 
1,500,000 FEE. 
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7 Conclusions and Future Work 

A novel approach based on the simplest version of the Differential Evolution algorithm, 
coupled with three simple criteria based on feasibility (CHDE) was proposed to solve 
constrained optimization problems. CHDE does not require a penalty function or any 
extra parameters (other than the original parameters of the DE algorithm) to bias the 
search towards the feasible region of a problem. Additionally, this improved approach 
has a low computational cost and it is easy to implement. Our algorithm was compared 
against three state-of-the-art techniques and it provided a competitive performance. Our 
future work consists on adding a diversity mechanism which does not increase its com- 
putational cost [8] in order to avoid premature convergence. 
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Abstract. This work has the aim of exploring the area of symbolic regression 
problems by means of Genetic Programming. It is known that symbolic 
regression is a widely used method for mathematical function approximation. 
Previous works based on Genetic Programming have already dealt with this 
problem, but considering Koza’s GP approach. This paper introduces a novel 
GP encoding based on multi-branches. In order to show the use of the proposed 
multi-branches representation, a set of testing equations has been selected. 
Results presented in this paper show the advantages of using this novel multi- 
branches version of GP. 



1 Introduction 

Genetic Programming (GP) is an evolutionary paradigm that has been used for 
solving a wide range of problems belonging to a variety of domains. Some examples 
of these applications are robot control, pattern recognition, symbolic regression, 
generation of arts (music and visual arts) [18], circuit design amongst others. Function 
approximation problem is based on a symbolic regression process. Symbolic 
regression by means of genetic programming has been already studied in previous 
works [7, 22]. However, this paper introduces a multi-branches encoding scheme for 
GP in order to deal with this sort of function approximation. The multi-branches 
encoding has been successfully applied for predicting climatological variables [16] 
and designing combinatorial circuits [19]. 

The aim of this work is then to approximate a set of equations by means of genetic 
programming with a multi-branches encoding proposed in this paper. 

The structure of this work is as follows. Section 2 presents a review of different 
proposals for representing executable structures. A detailed description of our muti- 
branches genetic programming approach is given in section 3. An important fact in 
genetic programming encoding is the concept of introns. This is described in section 
4. Problem definition is stated in section 5 and section 6 contains experiments details. 
Finally, section 7 gives results analysis, discussion and conclusions. 



R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 717-726, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 
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2 Representation of Executable Structure 

An important aspect of evolutionary algorithms performance is the individual 
representation. Angeline [1] mentioned that representation in Evolutionary 
Algorithms plays an important role in order to get a successful search. 

The work by Cramer in 1985 introduced the representation of computer programs. 
Cramer represented individuals by means of strings of constant size. However, 
representations of fixed size reduce the flexibility and applicability of these 
implementations. In 1992, Koza introduced the Genetic Programming paradigm. This 
uses representations of variable size, hierarchical tree structures. In Koza’s proposal 
these are S-expression (Koza described GP based on LISP programming language). 
This GP tree representation requires a set of primitives. Selecting the adequate 
primitives determines the efficiency of using this encoding [1]. 

New representation proposals followed Koza’s initial work. One of these was the 
use of modular structures (sub-routines). In this area, Angeline and Pollack [2] 
introduced the GLIB. GLIB used specialized operators dedicated to modules 
acquisition. These are mutation operators (compression and expansion) which 
modified the individual structures (genotypes) but not the individual evaluation 
(phenotypes). The GLIB operator of compression works by substituting a selected 
branch by a function. This is, the set of nodes contained in selected branch composes 
a module which is defined as a new function. The GLIB operator of expansion is the 
inverse process. A selected acquired module is substituted in the individual structure 
by the set of node contained in selected module. Then, the expanded module is 
deleted from the set of function. 

A similar version of GLIB approach are the ADL’s (Automatically Defined 
Lunctions). The aim of ADL’s is to protect branches with an important genetic value 
from the destruction of the genetic operators: mutation and recombination [10]. Rosea 
and Ballard [20] have also proposed an Adaptative Representation which defines 
heuristics for detecting useful branches. 

In some cases, alternative representations have been proposed for solving specific 
problems. In symbolic regression, an alternative is to use a representation by means of 
different types of polynomials. An example of this sort of representations is the 
GMDH (Group Method of Data Handling) proposed by Ivakhnenko [6], which uses a 
network of transfer polynomials for pruning a layer and the outputs of a previous 
layer are the inputs of subsequent ones. Nikolaev and Iba [15] introduced the 
Accelerated Genetic Programming of Polynomials which also considers the use of 
transfer polynomials combined with a recursive least squares algorithm. The set of 
polynomials used in these approaches are discrete version of Volterra series known as 
polynomial of Gabor-Kolmogorov. Others problems have been also solved by using 
genetic programming and polynomials as it is the case of modelling chemical process 
[5]. Using polynomials has the advantage of estimating associated coefficients by 
means of diverse forms. In Koza’s proposal [9], coefficients and constants values are 
obtaining by means of evolution process. Other GP versions compute coefficients by 
means of Least Squares algorithms [15] [17]. 

The multi-branches representation presented in this paper also states the use of 
polynomials. This approach searches a solution by dividing the problem into sub- 
problems (branches) and solving each of these sub-problems individually. Once sub- 
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problems are solved, partial solutions of branches are integrated in order to obtain a 
global solution of the problem [19]. 



3 Multi-branches GP Representation 

The multi-branches (MB) representation used in this work consists of four parts: a 
root node, branches, N+1 coefficients and an output. The number of coefficients are 
N+1, a coefficient for each branch plus the constant term. Figure 1 shows a diagram 
of this multi-branches structure. 

The two main genetic operators are crossover and mutation. Crossing over consists 
of selecting a pair of parent structures. Then, a branch is randomly selected in each 
parent and finally selected branches are exchanged between them. Mutation operator 
randomly selects an individual. A branch is then selected, deleted and substituted by a 
new branch randomly generated. Mutation and crossover processes are graphically 
shown in Figures 2 and 3, respectively. 
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Fig. 2. Mutation operator 




Fig. 3. Croosover operator: i) Branches selection and ii) Exchange of branches. 
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The GP parameters used in this version of genetic programming are similar to the 
ones used by Koza’s GP approach, except to parameters regarding maximum number 
of nodes (or maximum tree depth or size). In this MB version, the individual structure 
(expression) size is bounded by the number of branches and the maximum depth of 
these branches. 



4 Introns 

In GP, the individuals growth in size without any improvement in performance is 
known as bloat [12]. Some of the identified causes that produces bloating are the 
fitness functions [12] and the presence of introns [4] [14] [23]. Introns are parts of a 
chromosome that do not affect the individual evaluation (phenotype). 

Introns are considered as a sort of structures that protects part of the chromosome 
from destructive effects produced by genetic operators [13]. However, some 
researchers argue that introns have more negative effects than beneficial. In the 
literature, diverse mechanisms for destructing introns are reported; remove of 
redundant code (introns) by means of an edition operator [9], penalization of 
individual size [14][26], constraint operators [11][25] and alternatives selection 
schema [4]. 

Comparing the Koza’s GP version and the GP with multi-branches [16] for 
symbolic regression of climatological data, it was observed that GP with multiple 
branches produced solutions possessing a better precision and lower complexity [17]. 
The improvement in complexity is due to GP with multi-branches takes advantages of 
the definition of introns as it will be explain later on. 



5 Problem Statement 

Fitting data points (curves) can be performed by means of a combination of known 
functions. The function approximation g(x) is obtained by estimating the set of 
coefficients a,, one for each of the n known functions f(x) and a^, (the independent 
coefficient), as given by equation 1. 

g(x)=ao+aJj(x) + --- + aJJx) ( 1 ) 

= ao + ^aJi( x) 

i=i 

Determining a priori which is the set of more adequate functions is not an easy task. 
The set of known functions can be larger and the value of some coefficients can be so 
small that the associated function is insignificant and estimating coefficients for all 
possible functions is computationally expensive. A better approximation method 
should consider only the most significant functions and coefficients and also 
computational costs must be low. Thus, the central point is to find such functions p(x) 
and associated coefficients which are the most significant in order to better 
approximate a curve. This is as shown in equation (2). 
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Q—{ g ■ g is known function } 

P={p : pe Q, pis significant for approximation to g(x) } 

Vp/x) e P 

g(x)=bo+bjP,(x) + --- + b^pJx) ( 2 ) 

= bo + ib.pf x) 

i=i 

Using genetic programming, a given finite set of primitives (functions and terminals 
or argument) bounds the problem of function approximation. The set of primitives can 
then build a wide range of functions. The only restriction presented in GP is the 
maximum size of the hierarchical tree structure. 



6 Experiments 

A set of equations was considered in order to have a wide range of symbolic 
regression (function approximation) problems. These equations were based on 
Keijzer’s work [7]. The set of functions used in all these approximations is defined in 
Table 1. Based on Keijzer’s study, it was defined a maximum number of evaluations. 
From this parameter, maximum number of generations and population size can vary 
but without exceeding the number of evaluations. For each equation, a different 
domain range is specified as described in Table 2. 

The set of functions for testing this multi-branches genetic programming approach 
is given as follows: 



[8] f(x) = 0.3 X sin( 2itx ) 

[21] f(x) = x^ exp'"‘ cos( X )sin( x )(sin^( x )*cos( x ) - 1 ) 

30 X z 



[7] f(x,y,z)-- 



(x-lO)r 



[22] f(x) = tl/i 

i 

[22] f(x) = arcsinh] x ) 
[22] f(x) = arcsinh] x ) 
[22] f(x) = arcsinh] x ) 
[24] f(x,y) = x^ 



( 3 ) 

( 4 ) 

( 5 ) 

(6) 

( 7 ) 

( 8 ) 
( 9 ) 

( 10 ) 



[24] fix, y) = xy + sin( ( x-1 )( y-1)) 



( 11 ) 
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[24] fix, y) = x^ -x^ + y^ / 2 -y 


(12) 


[24] fix, y) = 6 sin( x )cos( x ) 


(13) 


[24] fix,y) = 8/(2 + x^ + y^ ) 


(14) 


[24] fix,y) = x^ /5 + y^ / 2 -y - x 


(15) 



Table 1. Parameter settings for genetic programming with multi-branches. 



Parameters 


Values 


Function set 


{ X H- y, X * y, 1/x, -X, sqrt(x) 


Number of evaluations 


10000 


Maximun Genome size 


400 


Population Size 


50 


% of Crossover 


95 


% of Mutation 


15 


Number of branches 


6 







Table 2. 


Problem settings. 




Problem 


Equation 


Range 

train 


Range 

test 


note 


1 


3 


x=[-l;0.1;l] 


[-1:0.001:1] 


25000 evals 


2 


3 


x=[-2:0.1;2] 


[-2:0.001:2] 


25000 evals 


3 


3 


x=[-3:0.1:3] 


[-3:0.001:3] 




4 


4 


x=[0:0.05:10] 


[0.5:0.05:10.5] 


(exp, log, sin, cos] 
60000 evals 


5 


5 


x,z = rnd(-l,l) 
y = rand(l,2) 


idem 

idem 


train 1000 cases 
test 10000 cases 


6 


6 


x=[l;l:50] 


[1:1:120] 


extrapolation 


7 


7 


x=[l;l:100] 


[1:0.1:100] 




8 


8 


x=[0; 1:100] 


[0:0.1:100] 




9 


9 


x=[0: 1:100] 


[0:0.1:100] 




10 


10 


x,y = rnd(0,l) 


x,y=mesh([0:0.01 : 1]) 


train 100 cases 


11 


11 


x,y = rnd(-3,3) 


x,y=mesh([-3:0.01:3]) 


train 20 cases 


12 


12 


x,y = rnd(-3,3) 


x,y=mesh([-3:0.01:3]) 


train 20 cases 


13 


13 


x,y = rnd(-3,3) 


x,y=mesh([-3:0.01:3]) 


train 20 cases 


14 


14 


x,y = rnd(-3,3) 


x,y=mesh([-3:0.01:3]) 


train 20 cases 


15 


15 


x,y = rnd(-3,3) 


x,y=mesh([-3:0.01:3]) 


train 20 cases 
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7 Result Analysis, Discussion, and Conclusions 

7.1 Results 

In Table 3, the normalized mean squared error (NRMS) of results from previous work 
[7] the same set of functions and the multi-branches genetic programming is shown. 
MB genetic programming is compared to Keijzer’s approach, GP with scaling and 
interval definition. Note that same set of functions are used in both cases. 

The NRMS value is computed by means of equation (16), where N is the number of 
fitness cases (data points), MSB is the mean squared error and cr, is the standard 
deviation of the target values. The best individual for each problem is then evaluated 
using testing data. Testing performance results are presented in Table 4. Finally, 
Table 5 shows the average complexity and the best solution complexity of each run 
for the genetic programming with multi-branches approach. Complexity is here 
measured as the number of nodes contained in evolved solutions. This includes both 
function nodes and terminal nodes. The amount of destructive overfitting was defined 
by Keijzer [7]. 



NRMS = 100* 

Gj 



N 



Hn-1) 



MSB 



(16) 



Table 3. Main training information of the best of run individuals produced by each version 



Problem 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


no interval 


20 


54 


76 


22 


2 


2 


2 


2 


2 


11 


12 


4 


30 


15 


10 


no scaling 


50 


78 


88 


46 


22 


49 


50 


56 


93 


54 


19 


4 


63 


91 


23 


interval + scaling 


8 


34 


62 


15 


1 


1 


1 


1 


1 


7 


11 


1 


25 


1 


7 


multi branches 


2 


24 


85 


19 


1 


0 


0 


1 


0 


3 


8 


7 


19 


13 


1 



Table 4. Amount of destructive overfitting for each of the problems 



Problem 


7 


2 


3 


4 5 6 7 8 9 10 


11 


12 


13 


14 


75 


no interval 


6 


11 


25 


1 49 


31 


30 


36 


23 


40 


no scaling 






2 






1 


15 




1 


interval + scaling 












1 


3 




1 


Multi branches 








2 


194 






0 





Table 5. Mean and the best solution complexity for each problem 



Problem 


7 


2 


3 


4 


5 


6 


7 


8 


9 


10 


77 


72 


13 


14 


75 


Mean 


62 


77 


61 


39 


40 


39 


38 


54 


39 


44 


39 


37 


43 


40 


30 


Best solution 


74 


83 


124 


39 


58 


42 


48 


83 


40 


58 


53 


26 


47 


48 


36 
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8 Discussion 

Results generated by means of genetic programming with multi-branches showed to 
be better, in terms of performance (NRMS), than the ones obtained by means of GP 
without scaling or no-interval presented by Keijzer [7]. In the case of 
GP-HscalingH-interval, results showed to be similar but multi-branches GP gave 
slightly better performance in many cases. However, there were three cases (equations 
(3), (4) and (14)) where the combined use of scaling + interval exhibited a better 
performance than multi-branches GP. In the case of equation (3) and using 
scalingH-interval, error values were also high. Regarding overfitting values, 
observations are quite similar in both methods, as mentioned previously. It is 
important to mention that computational cost required to obtain these results, was less 
in the case of the present paper. The number of evaluations considered for the multi- 
branches genetic programming were 10000, except for cases indicated in Table 2 
where the number of evaluations coincided with previous work (25000 evaluations). 
In terms of maximum genome size, multi-branches GP used a maximum of 400 
nodes, whereas 1024 were used for others approaches. It is observed in Table 5 that 
the number of nodes of best solutions were in the range of [26, 124] nodes. 

Introns can be classified into four groups: hierarchical, horizontal, asymptotical 
and incremental fitness [4]. Genetic programming with multi-branches presents a kind 
of introns which do not fall into any of these four groups. This new type of introns 
shows a beneficial effect during evolution. An intron can occur when two or more 
branches produce the same partial output. Coefficients values of branches that 
produce same partial output are zero, except one whose value is computed by means 
of a Least Squares algorithm. If these similar branches belongs to a higher fitness 
individuals, then they are kept; otherwise they will be destroyed during evolution 
process. Other type of introns presented in multi-branches representation are the 
branches whose coefficient is zero. These branches do not contribute in the final 
solution and complexity can be reduced. 



9 Conclusions 

The multi-branches representation for genetic programming introduced in this paper 
has proved to be powerful. It has been tested on function approximation problems and 
results showed to be promising. It was also observed that computational cost tends to 
be reduced by using this representation. It is also relevant to note that introns in multi- 
branches representation are easily detected and show beneficial effects. 

Further studies will focused on both the flexibility of this representation in diverse 
domains and the effects and control of introns. 
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Abstract. A preprocessing procedure that uses a local guided search 
defined in terms of a neighborhood structure to get a feasible solution 
(UB) and the Osorio and Glover [18], [20] exploiting of surrogate 
constraints and constraint pairing is applied to the traveling salesman 
problem. The surrogate constraint is obtained by weighting the original 
problem constraints by their associated dual values in the linear 
relaxation of the problem. The objective function is made a constraint 
less or equal than a feasible solution (UB). The surrogate constraint 
is paired with this constraint to obtain a combined equation where 
negative variables are replaced by complemented variables and the 
resulting constraint is used to fix variables to zero or one before solving 
the problem. 

Keywords: TSP problem, Surrogate Constraint Analysis, Preprocess- 
ing. 



1 Introduction 

The TSP has received great attention from the operations research and computer 
science communities because is very easy to describe but very hard to solve [2] . 
The problem can be formulated saying that the traveling salesman must visit 
every city in his territory exactly once and then return to the starting point. 
Given the cost of travel between all cities, he should plan his itinerary for a 
minimum total cost of the entire tour. 

Space solution for TSP is the n-cities permutation, n!. Any simple permuta- 
tion is a different solution. The optimum is the permutation that correspond to 
a travel with the minimum cost. The evaluation function is very simple, because 
we only need to add the cost profit associated with each segment in the itinerary, 
to obtain the total cost for that itinerary. 

The TSP is a relatively old problem. It was already documented in 1759, 
with a different name, by Euler. The term ‘traveling salesman’ was first used in 
1932 in a German book written by a veteran traveling salesman. The TSP, in the 
way we know it now, was introduced by the RAND Gorporation in 1948. The 
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Corporation’s reputation helped to make the TSP a well known and popular 
problem. The TSP also became popular at that time due to the apparition of 
linear programming and the attempts to solve combinatorial problems. 

In 1979, it was proved that the TSP is NP-hard, a special kind of NP-complete 
problems (see Carey et al, [1]). All NP problems can be reduced polynomialy 
to them. It means that if one can find a solution in polynomial time to one of 
them, with a deterministic procedure, it may find it for all NP and then, P=NP. 
Nobody has been able to find efficient algorithms for NP-complete problems 
until now, and nobody has demonstrated that such algorithms do not exist. 

The TSP can be symmetric or asymmetric. In the symmetric case, departure 
and return costs are the same and can be represented with an undirected graph. 
For the asymmetric case, the more common one, the departure and return costs 
are different and can only be represented by a directed graph. Because the sym- 
metric problem is usually harder than the asymmetric one, this research was 
directed to the symmetric case. 

The TSP has become a classic problem because it serves to represent a great 
number of applications in real life, as the coloring sequence in textile industry, 
the design of insulating material and optic filters, the impression of electronic 
circuits, the planning of trajectories in robotics and many other examples that 
can be represented using sequences (see Salkin [21]). Besides, it may represent a 
big number of combinatorial problems that cannot be solved in polynomial time 
and are NP hard. 

The exponential nature of the time needed to solve this problem in an ex- 
act way has originated, during the last decades, the development of heuristic 
algorithms to approximate its optimal solution (see Gass [2]). 

To relate the experience obtained in this research, we structured the present 
paper in the following way. In section 2, we present the Integer Programming 
formulation for TSP. In section 3, we describe the Dual Surrogate Constraint, 
and the Paired Constraint in section 4. In Section 5 we present an example 
solved with our approach. Section 6, shows experimental results and Section 7, 
the Conclusion. 



2 Integer Programming Formulation 

As we mentioned before, a traveling salesman must visit n cities, each exactly 
once. The distance between every pair of cities ij, denoted by dij,{i yf j), is 
known and may depend on the direction traveled (z.e., dij does not necessarily 
equal dij). The problem is to find a tour which commences and terminates at 
the salesman” s home city and minimizes the total distance traveled. 

Suppose we label the home city as city 0 and as city n -I- 1. (Then we may 
think of the salesman” s initial location as city 0 and the desired final location 
as city n + 1). Also, introduce the zero-one variables (i = 0, 1, . . . ,n,j = 
,n+ l,i yf j), where Xij = 1 if the salesman travels from city i to j, and 
Xij = 0 otherwise. To guarantee that each city (except city 0) is entered exactly 
once, we have Xij = 1, (j = 1, . . . , n -I- 1, t yf j). 
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Similarly, to ensure that each city (except city n + 1) is left exactly once, 
we have = 1) (* = 0) ■ • ■ ) ''t-, * 7^ j)- These constraints, however, do not 

eliminate the possibility of subtours or “loops” . One way of eliminating the 
subtour possibility is to add the constraints at — aj + (n + l)xij < n,{i = 
0, ...,n,j = j). 

Where Ui is a real number associated with city i. To complete the model 
we should minimize the total distance between the cities. An integer program- 
ming formulation of the traveling salesman problem is to find variables Xij and 
arbitrary real numbers Oj which: 

Minimize i 

Subject to Yh=o = l (j = 1, . . . , n -b 1, z yf j) 

Yjjti Xij = l (z = 0, . . . , n, z yf j) 

Ui — aj + {n+ l)xij < n {i = 0, . . . ,n,j = 1, . . . ,n+ l,i ^ j) 
ai > 0 (z = 0, . . . , zz -b 1) 

Xij G {0,1} (z = 0, . . . ,n, j = 1, . . . ,n-b l,z yf j) 

Where a;o,n-i-i = 0 (since Xij = 0 for z = j). This formulation originally ap- 
peared in Tucker [22], and avoids subtours successfully, but enlarge considerable 
the model that now has (n + 1)^ -b 2 variables with (n -b 1)^ — n binaries and 
(rz -b 2) -b (zz -b 1)^ constraints. 

3 Dual Surrogate 

As defined by Glover [4], a surrogate constraint is an inequality implied by the 
constraints of an integer program and designed to capture useful information 
that cannot be extracted from the parent constraints individually, but which 
is nevertheless a consequence of their conjunction. The integer program can be 
written as: 

Minimize cx 

Subject to Ax < b, 0 < x < e 

and X integer 

Since Ax < b implies b — Ax > 0, we have for a nonnegative weighting 
vector u that u(b — Ax) > 0 is a surrogate constraint. A value of u is selected 
which satisfies a most useful or a “strongest” surrogate constraint definition as 
given in [4], [5]. It has been shown by Glover [5] that u comprises the optimal 
values of the variables of the dual linear program of the corresponding relaxed 
LP and that the weighting vector in a strongest constraint consists of the optimal 
dual variables of the associated linear program. 

Optimality conditions for surrogate duality are the requirements that the sur- 
rogate multiplier vector u is nonnegative, x is optimal for the surrogate problem, 
and X is feasible for the primal problem. ’‘Strong” optimality conditions add the 
requirement of complementary slackness. A complete derivation of this theory 
can be seen in Glover [5]. The methodology proposed here relies on these fun- 
damental results. 
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4 Paired Constraint 

The main ideas about constraint pairing in integer programming were exposed 
by Hammer et al. [9] . Based on the objective of getting bounds for most variables, 
the strategy is to pair constraints in the original problem to produce bounds for 
some variables. 

Based on the results exposed about surrogate constraints, the dual surrogate 
constraint provides the most useful relaxation of the constraint set, and can be 
paired with the objective function. If we name K = (n+ 1)^ + (n + 2), the total 
number of constraints and L = {n+ 1)^ + 2, the total number of variables, the 
resulting surrogate is: 



K K 

^ ^ ^ ^ ^ ^ — I 5 ■•■5 L ( 1 ) 

fc=l k=l 

Where Uk are the dual values for every surrogate, aki, the coefficient in row 
k and column I, Zk the fcth variable (it may be Xij or a^), bk the fcth right hand 
side. Now, we define 

K 

S[ — 'y ^ ‘^ki^^kl^k) ^ ^ ~ I5 ■■■5 kj ( 2 ) 

k=l 

Besides, we made the objective function less or equal than a known feasible 
integer solution (UB). This integer solution was obtained using a guided local 
search defined in terms of neighborhood structure, where tour B is a, neighbor of 
tour A and it can be obtained from A by specific type of perturbation or move. 
It takes infinitesimal CPU times to get a feasible tour with this procedure [14]. 

The paired constraint between the surrogate and the objective function will 
be, 

L K 

'^{ci - si)zi < UB -'^Ukbk (3) 

1 = 1 k=l 

To be able to use constraint 3 to fix variables in both bounds, all coefficients 
must be positive or zero. We substitute yi = 1 — zi in the negative coefficients 
(c; — si) to get positive ones (c; — s;)' and add the equivalent value in the right 
hand side. The right hand side of the surrogate is the LP optimal solution (LB), 
and the right hand side of this paired constraint becomes the difference between 
the best known solution, the upper bound (UB), and the LP solution, the lower 
bound (LB). The resultant paired constraint used to fix variables to zero or one, 
is 

L L 

{ci-si)zi+ {ci - si)'yi < UB - LB (4) 

/=l,(cj— S1>0) i=l,(ci— Si< 0 ) 

If coefficients (c; — s;) of zi are greater to the difference (UB-LB), those 
variables must be zero in the integer solution; if the coefficients (c/ — si)' of yi 
are greater to the same difference, those variables must be one in the integer 
solution because its complement, yi must be zero. Variables whose coefficients 
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are smaller than the difference remain in the problem. Because we depend on the 
gap UB-LB and LB can not be changed because it is the LP continuous relaxed 
solution of the problem, a better UB given by the best integer solution known, 
can increase the number of integer variables fixed. 



5 Example 



We illustrate the procedure in the following example. Table 1 shows the distances 
for a traveling salesman problem with 3 cities. 



Table 1. Distances for the 3-cities example 



From/To 


1 


2 


3 


1 


00 


26 


82 


2 


134 


00 


117 


3 


38 


13 


00 



The Integer Programming formulation for this example is: 

Minimize 26xi2 -I- 82xi3 -I- 134x2i -I- 117x23 + 38x3i -I- 13x32 
Subject to Xoi -I- Xo 2 + Xq3 + Xq4 = 1 
Xi2 + a;i3 -I- Xi4 = 1 
X21 + X2Z + X2A=1 
Xzi + X 32 -I- X 34 = 1 
a^oi + 2:21 + 2:31 = 1 

202 + 212 + 232 = 1 

203 + 212 + 223 = 1 

204 + Xi 4 + X 24 + 234 = 1 
4xoi -b «o ~ Cti < 3 

4xo 2 -b Oo - «2 < 3 
4x03 + CIO - 0:3 < 3 
4xo4 -b Oo — 0:4 < 3 
4xi2 -b Oi — 02 < 3 
4xi3 + Oi — 03 < 3 
4 xi 4 -b Oi — O 4 < 3 
4x21 + 02 — Oi < 3 
4X23 + 02 — 03 < 3 
4x24 -b 02 — O 4 < 3 
4x31 -b 03 — oi < 3 
4X32 -b O 3 — 02 < 3 
4 x 34 -b O 3 — O 4 < 3 

Oi > 0, (t = 0, ...,4) 

Xij G {0, 1}, (z = 0, ...,3,j = l,...4,zyb j) 
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The relaxed LP problem substitutes Xij £ {0, 1} by 0 < Xij < 1. The LP 
optimal solution is 64 and the dual values for the constraints (not including the 
bounds) are: 

Ui = {-51,0,0,-13,51,26,51,0,0,0,0,0,0,0,0,0,0,0,0,0,0} 

The surrogate constraint, the paired constraint and the variables fixed can 
be seen in Table 2. 



Table 2. Surrogate and Paired Constraints 





Xoi 


X02 


*03 


*04 


*12 


*13 


*14 


*21 


*23 


*24 


*31 


*32 


*34 




RHS 




Cl 


0 


0 


0 


0 


26 


82 


0 


134 


117 


0 


38 


13 


0 




80 


UB 


Si 


0 


-25 


0 


51 


26 


51 


0 


51 


51 


0 


38 


13 


-13 


< 


64 


LB 


Cl - Si 


0 


25 


0 


51 


0 


31 


0 


83 


66 


0 


0 


0 


13 


< 


16 


UB-LB 


Xij 




0 




0 




0 




0 


0 












fixed 





6 Experimental Results 

We tested our procedure with 30 symmetric instances generated with a random 
exponential distribution that produces specially hard instances [19]. The average 
values obtained for every set of five instances with the same number of cities 
but generated with different seeds, are reported in Table 3. The problems were 
solved in a Pentium III with 1066 MHz and 248 MB in RAM. To obtain the 
LP solution and to solve the problem to optimality, we utilized ILOG CPLEX 
8.0. The feasible solution used as UB was obtained with a guided local search 
defined in terms of neighborhood structure [16]. 



Table 3. Results for Hard Instances 



Number Best Fixed %Fixed %Rel.Dif CPU Optimal 
of Cities Known Variables Variables between Soln’s Secs. Solution 



10 


1696 


57.8 


51.88 


7.17 


0.758 


1582 


20 


2321 


174.4 


37.67 


20.62 


7.525 


1924 


30 


3095 


205.8 


20.72 


58.78 


97.17 


1949 



6.1 Hard Problem Generation for TSP 

We developed a generator that produces challenging TSP problems. Following 
the ideas presented in Osorio and Glover [19], our approach uses independently 
exponential distributions over a wide range to generate the distances between 
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the cities. This kind of instances takes at least 10 times the number of CPU 
seconds and 100 times the number of nodes in the searching tree, required for 
CPLEX to get optimality than the instances generated with a random uniformly 
distribution. The problem generator used to create the random instances of TSP 
is designed as follows. The distances between the cities, dij, are integer numbers 
drawn from the exponential distribution dij = 1.0 1000ln{U{0, 1)). 

7 Conclusions 

Our procedure is a very easy way to fix binary variables to their bounds in TSP 
instances. It can be seen as an effective preprocessing that reduces the binary 
number of variables to be fixed in a searching tree. The procedure is simple and 
utilizes a local guided search defined in terms of neighborhood structure to get 
a feasible tour and surrogate analysis with results from the solution of the LP 
relaxed problem. The results obtained shows that a percentage of variables can 
be fixed in a short amount of time for many different instances. 
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Abstract. This paper introduces a CAD tool, ASPIRE (Automatic 
Spatial Partitioning In Reconfigurable Environments), for the spatial 
partitioning problem for Multi-FPGA architectures. The tool takes as 
input a HDL (Hardware Description Language) model of the application 
along with user specified constraints and automatically generates a task 
graph G; partitions the G based on the user specified constraints and 
maps the blocks of the partitions onto the different FPGAs (Field Pro- 
grammable Gate Arrays) in the given Multi-FPGA architecture, all in a 
single-shot. ASPIRE uses an evolutionary approach for the partitioning 
step. ASPIRE handles the major part of the partitioning at the behav- 
ioral HDL level making it scalable with larger complex designs. ASPIRE 
was successfully employed to spatially partition a reasonably big cryp- 
tographic application that involved a 1024-bit modular exponentiation 
and to map the same onto a network of nine ACEXIK based Altera 
EP1K30QC208-1 FPGAs. 

1 Introduction 

Functional specifications of semiconductor products change frequently in com- 
pliance with market requirements and evolving standards. This necessitates a 
hardware environment that can be programmed dynamically. Reconfigurable 
systems provide a viable solution to this problem. But, such reconfigurable sys- 
tems demand expensive and tedious initialization phases that in turn entail 
concepts like temporal [1] and spatial partitioning [3]. A given application that 
is too large to fit into the reconfigurable system, all at once, is passed through 
the temporal partitioning phase, which ensures a break up of the application 
into temporal segments that maximize on resource utilization and minimize on 
execution time. Each of the temporal segments is further spatially partitioned, 
in order to establish a mapping of the tasks and the memory segments within a 
specific temporal segment onto the available FPGAs, in a manner that enhances 
the efficiency of the implemented logic. An evolutionary approach to solve the 
spatial partitioning problem is presented in [3]. The concepts presented in [2], 
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though not directly related to the theme of this paper, provide significant in- 
sight on the utilization of primitive genetic algorithms to solve the problem of 
hardware software partitioning and hardware design space exploration. 

1.1 Previous Work 

The genetic search procedure was developed by John Holland [4] in 1975 and 
since then it has been successfully employed to provide solutions for the various 
combinatorial problems in VLSI design automation [5,6]. The genetic algorithms 
help the solutions to get out of local optima, particularly where multiple con- 
straints are involved like the spatial partitioning problem solved in this paper. 
One of the latest papers published in the area of spatial partitioning is by Jose 
Ignacio Hidalgo [7]. The GA algorithm employed here [7] does not capture all the 
constraints for e.g. memory constraints, speed of execution, routing constraints 
etc. The crossover operators are not so effective. The CPU time required for 
getting desirable results is large. The partitioning done in [7] is at net-list (cell) 
level making it less scalable with increasing design size. The algorithm developed 
by lyad Quaiss [3] considers the partitioning at behavioral level. 

1.2 Contributions of This Paper 

ASPIRE takes as input a HDL model of the application along with user specified 
constraints and automatically generates a task graph G; partitions the G based 
on the user specified constraints and maps the blocks of the partitions onto the 
different PPG As in the given Multi-FPGA architecture, all in a single-shot. The 
crux of our contribution is a genetic approach for the spatial partitioning problem 
for FPGAs which is employed by ASPIRE in the partitioning step. One of the 
most notable feature of this paper is that the spatial partitioning problem not 
only maps different tasks in a temporal segment to the multiple FPGAs, but also 
routes the pin to pin inter-FPGA connections between the multi-FPGAs such 
that, some critical parameters of the routing are taken care of. These parameters 
include time-critical nets, routing congestion on interconnection devices and the 
overall inter-FPGA signal propagation delay. 



2 Overview of ASPIRE 

The ASPIRE tool follows a three-phased approach, as shown in Figure 2.1 to 
meet the desired objectives. The first phase of ASPIRE verifies whether the given 
module can be fit onto single FPGA.If it is not possible to map the module onto 
a single FPGA, the tool splits the module depending on the instantiations and 
various constructs in the module into smaller independent modules taking care 
of all interactions between the modules arising from splitting the bigger Parent 
module. This process is carried over recursively for every module. An example 
of a generation of task graph is shown in the Figure 2.2, where a module is split 
into 16 modules until each can be fit onto a single FPGA. The shaded ovals 
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Fig. 2.1. Three Phase Approach Fig. 2.2. Task graph 



shown are the modules that cannot fit into a single FPGA and that need to be 
split and the leaves of the graph gives the final modules that can fit onto a single 
FPGA. 

After obtaining the modules that can be fit onto a single FPGA, the second 
phase of the tool employs a genetic approach to agglomerate the modules effi- 
ciently satisfying the constraints so that they can be fit into minimum number 
of FPGAs. The third phase does mapping onto the FPGAs. 



3 Spatial Partitioning 

The target architecture onto which the application is to be mapped is assumed to 
be a network of N FPGAs, each possessing a local memory, and all of them having 
access to a global shared memory [3]. Throughout this paper, let N refer to the 
total number of FPGAs available on the target architecture. During the process 
of spatial partitioning, we are required to assign a particular FPGA from the set 
of N FPGAs to every task that is a part of a particular temporal segment. Apart 
from that, the spatial partitioning phase also addresses the problem of mapping 
logical memory segments to local or shared memory. The complexity of the prob- 
lem increases due to various constraints associated with spatial partitioning [1]. 

— Area Constraint: The logic of the different tasks mapped on to the N 
FPGAs dictates the amount of area required on each FPGA. As the available 
area for each FPGA is fixed, this imposes a constraint on the logic that could 
be realized. 
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— Speed Constraint: The speed with which the implemented circuit works 
is of utmost importance. It is required to use certain heuristics to determine 
an implementation with good speed of execution else it is not practically 
feasible. 

— Memory Constraint: The memory segments of the tasks of a spatial par- 
tition should not exceed the memory resources available. 

— Connection Constraint: The Multi-FPGA system would be required to 
establish a connection between specified FPGAs within it, and also to the 
external world. Hence, the utilization of the pin resources should be done 
judiciously at this stage. 



3.1 Phase I: The Task Graph Generation 

The first phase of ASPIRE involves splitting a given application, specified in a 
HDL format, into smaller modules that can fit into the FPGAs. 

Definition 1 (Definition).- Given a set ofn HDL modules M = (mi, m 2 , ..m„) 
and k distinct pairs of edges E = (ei,e 2 ..efc) between modules such that each Ci 
is an ordered pair of vertices < mx,rriy >, then the circuit can he represented as 
a graph G = (M,E) which is called a task graph. A graph G' = (M',E') is a 
subgraph of G = {M, E) if and only if M' C M, E' C E. 

A given module can be fit into a given single FPGA as long as the logic blocks 
and the number of pins required by the module does not surpass the maximum 
number of the available logic blocks and pins in the FPGA. The algorithm for 
the construction of the task graph is as follows: 

Function TaskGraph (HDL description M) 
begin 

If M fits in a single FPGA 
then return 
else 
begin 

Split M into smaller modules 
(mi, m 2 , ...m„) 

Add nodes mi,m 2 ..m„ to task graph 
Update the task graph to capture the 
relationships 

For each m^ call TaskGraph(mi) 
end 

End Function 

The task array generated by this phase is given as an input to the genetic 
algorithm, which computes an optimal mapping of the tasks onto the FPGAs 
considering all the constraints imposed by the architecture. 




An Evolutionary Algorithm for Automatic Spatial Partitioning 739 



3.2 Phase II: The Genetic Approach 

The second phase of the tool ASPIRE takes it’s input from the first phase. The 
input is in the form of a task graph. This phase provides an optimal mapping of 
the tasks on to the given FPGA architecture using an evolutionary strategy. 

ENCODING. The encoding used is identical to the one utilized in [1]. Two 
arrays are maintained, a task array Task and a memory array Memory. The 
length of Task is equal to the number of tasks in the task graph, t, while the 
length of Memory is equal to the number of memory segments, m. For 1 <i <t, 
the variable Task[i], ranging from 1 to N, represents the FPGA number to which 
task i is assigned to. Similarly for 1 < i < m, the variable Memory[i], ranging 
from 1 to N, represents the memory bindings. Memory[i]=0 implies that the 
memory segment i is mapped to he shared memory. The Task and Memory 
arrays together constitute a chromosome. 

INITIAL POPULATION. The task arrays for all chromosomes in the initial 
population are set to random legal values. Then based on task assignments, for 
each chromosome, we map the logical memory segments to local/global physical 
memories. If the majority of the tasks, which access a given logical memory 
segment M, are assigned to a FPGA f, then M is mapped onto the local memory 
present in f. Through a lot of experimentation, we decided to start with an initial 
population of size N. 

MATING. The following genetic operators are used in the mating phase of 
the proposed genetic algorithm. We present a new approach to the Area and 
Memory Gonstraint operators. We introduce Speed Gonstraint Operator and 
Routing Grossover Operator which are not considered by previous approaches. 

— Area Constraint Operator (AGO): This operator takes in two parents. 
Parent 1 and Parent2, and produces a total of two offspring that are, with 
a high probability, better off as far as area constraints are concerned. Note 
that AGO only acts on the Task arrays of the parents and not on the Memory 
arrays. The AGO has been framed in a manner that retains all the infor- 
mation regarding the tasks up till the first violation of the area constraint. 
The parents are then crossed over in an attempt to reorganize the tasks in a 
manner that would progressively reduce the number of area conflicts, as the 
population evolves. Gross over at point i means the child gets the mapping 
of tasks from 1 to i from first parent and from i to end of tasks from second 
parent. 

Function AGO (Parent PI, Parent P2) 

• Obtain the index, il, of the first task in the Task array 
of PI, that causes an area constraint violation. 

• If such a conflict does not exist, then randomly choose 
a index. 
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• Obtain the index, i2, of the first task in the Task array 
of P2, that causes an area constraint violation. 

• If such a conflict does not exist, then randomly choose 
a index. 

• Obtain Childl by crossing over the two parents at il. 

• Obtain Child2 by crossing over the two parents at i2. 

End Function 

Pin Constraint Operator (PCO): The PCO attempts to optimize on the 
delay caused by the pin connections and hence reduce the communication 
time of the circuit when it is up and running. The reasoning behind the 
PCO is to minimize the inter-FPGA communication. The time required for 
a signal to propagate from one FPGA to another is much more than the 
time taken for intra-FPGA information transmission. Hence, in the PCO, 
the task that uses the greatest number of pins is chosen, since that task 
might have been placed in the wrong FPGA. By moving the task to another 
FPGA, one might be able to drastically cut down on the number of pins 
used, and hence the total time required for the signals to propagate. 

Function PCD (Parent PI, Parent P2) 

• Obtain the index, il, of the task that utilizes the maximum 
number of pins in PI. Break ties arbitrarily. 

• Obtain the index, i2, of the task that utilizes the maximum 
number of pins in P2. Break ties arbitrarily. 

• Obtain Childl by crossing over the two parents at il. 

• Obtain Child2 by crossing over the two parents at i2. 

End Function 

Memory Constraint Operator (MCO): MCO requires two chromo- 
somes, Parent 1 and Parent2, and produces two offspring that are hopefully 
better off as far as memory constraints are concerned. Since the amount of 
memory available is of a finite quantity, one has to ensure the judicious allo- 
cation of the memory resources. MCO initially checks for memory conflicts, 
and if they do not exist, it then checks for the memory segment that is most 
accessed by tasks in other FPGAs. Such a segment, if placed in another 
FPGA, would reduce the number of accesses to it from other FPGAs, as a 
consequence of which, the speed with which the circuit executes would be 
made more efficient. The Function MCO is similar to AGO with the Task 
array replaced by Memory array and area violation by memory violation. 
Default Crossover Operator (DCO): Most chromosomes in the popu- 
lation would already be valid solutions, but they may not be close to the 
optimal, hence we utilize the DCO in an attempt to bring the solution as 
close as possible to the optimal solution. The DCO chooses random locations 
in the Task and Memory arrays and crossover the two parents to obtain the 
required offspring. 
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Function DCD (Parent PI, Parent P2) 

• Choose a random index, il, in the Task array. 

• Choose a random index, i2, in the Memory array. 

• Crossover PI and P2 at il in the Task array and at i2 in 
the Memory array. 

End Function 



For the process of mating, the best of the population is chosen along with a few 
random selections as well. The chosen chromosomes are paired up, and one of 
the above four operators are applied randomly. Note that applying AGO to a 
solution that does not violate the area constraint would not be possible, in such 
cases the algorithm simply returns and the DCO is used. 



MUTATION. The mutation operator is an attempt to save characteristics that 
have been lost over several generations. This operator randomly chooses a value 
from the Task and the Memory array and changes it to another legal value. A 
probability of mutation is associated with each chromosome in the population, 
which is basically inversely proportional to its fitness value. We choose Nmutation 
number of chromosomes during every iteration for the mutation process. 



FITNESS FUNCTION. The fitness function has been divided into five dis- 
tinct sections. For each section we obtain a fitness value in the range [0..10], 
where 0 indicates a bad solution, and 10 indicates an excellent solution. We 
weigh the five fitness values and scale them up appropriately to a value in the 
range [0..10]. The five sections are as follows: 

1. Area Fitness Value (AFV): Obtain the number of FPGAs, G, in which 
the area required by the logic exceeds the area provided by the FPGA. AFV 
is determined by computing {{N — C)/N) * 10. The higher the value, the 
better the chromosome is with respect to the area constraint. 

2. Pin Connection Fitness Value (PFV): Obtain the total number of FP- 
GAs, E, in which the number of pins available is less than the total number of 
pins required by the logic and memory mapped on to it. PFV is determined 
by computing {{N — E)/N) * 10. 

3. Memory Fitness Value (MFV): Obtain the number of local memory and 
shared memory resource violations, V. MFV is determined by computing 
((N-V)/N)*10. The equation has been framed in a manner such that a high 
value indicates a better solution with respect to the memory constraints. 

4. Speed Fitness Value (SFV): The speed with which the implemented 
circuit operates is greatly dependent on the location of a task t, with respect 
to the variables it interacts with. The speed would be greatly enhanced if 
most of the variables used by t were mapped onto the local memory of the 
FPGA in which t is mapped. The fitness value is a function of the proximity 
between the FPGAs mapping a task t and those storing the variables that 
interact with t. Gloser are the variables to t, the higher is the fitness value. 
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The critical interacting paths between the tasks should be short for faster 
execution, so such tasks should be mapped close enough, this takes care of 
critical paths This fitness function depends on the architecture of the multi- 
FPGA system. 

5. Routing Fitness Value (RFV): The routing fitness value tries to min- 
imize on various routing factors such as the total length of wires used to 
route, and the total number of segments used for routing. It ensures that 
the spatial partition is such that it is possible to find a routing on the target 
architecture. 

The final fitness value is determined by weighing the above five fitness values in 
the required proportions and scaling the final value to a range of [0..10]. Such 
an approach of computing the fitness value takes into account all the factors 
associated with the most important one being given higher priority. It can also 
be easily extended to include other constraints as well. 



3.3 The Final Genetic Algorithm 

The sequential genetic algorithm, which provides a solution to the spatial parti- 
tioning problem, is provided below. 

SpatialGeneticSequential (Problem Specification) 

— Create initial population; 

— while (solution with the required fitness value has not been 
found) do 

• Choose Parents for mating process 

• Pair up Parents Randomly euid Carry Out Mating 

• Add Offspring to population 

• Carry out Mutation 

— return (the chromosome C with maximum fitness value) 

End Function 

The third phase of the tool ASPIRE is a simple mapping tool which requires an 
iterative algorithm and is hence not explicitly discussed here. 



4 Experimental Results 

We have tested ASPIRE on many large circuits and obtained very good results. 
An implementation of modular exponentiation based architecture proposed by 
Thomas Blum and C.Paar [8] was chosen to illustrate the efficiency of spatial 
partitioning algorithm proposed in this paper. The application has large uti- 
lization in the area of cryptography. It is a resource efficient architecture and 
suitable for implementation in FPGAs. The design of this architecture is based 
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Table 4.1. Comparison of results with existing GA [3] 



S.No 


Number of tasks 


Run-times 
(in secs) using 
normal GA as in [3] 


Run-times using 
using algorithm 
(ASPIRE) 


1 


5 


2.46 


1.1 


2 


8 


14.48 


10.1 


3 


10 


46.2 


12.46 


4 


18 


102.1 


20.3 


5 


20 


153.1 


25.5 


6 


40 


228.4 


130.6 


7 


60 


456.2 


150.8 


8 


100 


761.1 


200.7 



Initial Population Vs Rintime 

"S' 




Graph 4.1. Initial population versus Runtime 




Graph 4.2. Initial population versus deviation from optimal. 
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on a systolic array, which computes the modular exponentiation. The total num- 
ber of logic blocks required by the circuit was 14365, certainly more than the 
maximum present in a single ACEXIK FPGA. Given the Verilog description 
of the architecture, ASPIRE generated a task graph of 151 nodes and mapped 
the same onto a network of nine FPGAs of the AGEXIK type of Altera (Num- 
ber EP1K30QG208-1). Each AGEXIK FPGA has 1728 logic blocks and 147 
I/O pins. The proposed genetic approach was implemented on a 1.2 GHz dual 
processor Intel Xeon server. 

In addition, our approach was tested on randomly generated task graphs and 
the results are shown in Table 4.1. Table 4.1 shows that our program takes one- 
third of the time that is taken by the approach in [3], even for task graphs with 
small number (for e.g., 20) of tasks. Our approach becomes faster with more 
number of tasks, implying its scalability with increasing number of tasks. An 
initial population in the range N to 2N, where N is the number of tasks in the 
task graph, assisted in attaining an optimal balance between the runtimes and 
fitness values (Graphs 4.1 and 4.2). Similar experiments were conducted to fix 
the rates of mating and mutation. A rate of mating in the range of N/4 to N/8 
and a rate of mutation of N/8 proved to be ideal for our algorithm. 



5 Conclusion 

In this paper we have presented a completely automatic spatial partitioner, which 
takes a HDL description of a big application and maps onto a given Multi- 
FPGA architecture. The crux of our approach is a spatial partitioner, which 
uses an evolutionary approach to solve the problem. ASPIRE was successfully 
employed to spatially partition a reasonably big cryptographic application that 
involved a 1024-bit modular exponentiation. The genetic spatial partitioner was 
also experimented on random task graphs and shown to work efficiently and 
scalable with larger task graphs. Future work in this field may be oriented in 
enhancing the crossover operators and the fitness functions. We are currently 
working in parallel simulations of the genetic approach to further reduce the 
time complexity of our algorithm. 
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Abstract. It is propose new evolutionary algorithms with exaptive 
properties to tackle dynamic problems. Exaptation is a new theory with 
two implicit procedures of retention and reuse of old solutions. The re- 
tention of a solution involves some kind of memory and the reuse of a 
solution implies the adaptation of the solution to the new problem. The 
first algorithm proposed uses seeding techniques to reuse a solution and 
the second algorithm proposed uses memory with seeding techniques to 
retain and reuse solutions respectively. Both algorithms are compared 
with a simple genetic algorithm (SGA) and the SGA with two popula- 
tions, where the hrst one is a memory of solutions and the second popu- 
lation is searching new solutions. The Moving Peak Benchmark (MPB) 
was used to test every algorithm. 



1 Introduction 

In recent years the optimization of dynamic problems has become a growing 
field of research. The real-world problems are not static, they exist in a dynamic 
environment and it is necessary to modify the current solution when a change is 
detected. We need evolutionary algorithms that do not re-start in every change; 
these algorithms must take advantage of the population information to obtain 
a valid solution in a short time. It is expected a similar solution when there are 
a minimum change of the problems and a dissimilar solution when there are an 
important change of the problem. 

Some examples of dynamic problems are job shop problems where there are 
changes in due time, changes in the number of machines, processing times, etc., 
and these changes imply a re-schedule in the job shop. Learning in dynamic 
environments is a desirable quality for mobile robots. Navigation represents a 
simultaneous problem of path planning and movement to the goal along the 
path. Finally consider an electric company where in some periods there is high 
demand of energy and in another periods there is low demand. Usually it is 
necessary an optimization algorithm to manage and control sources efficiently 
with this dynamic demand. All of these dynamic problems can be modeled with 
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variables, the optimization function and a set of constrains. Every one of them 
can changes through time [1]. Many authors have suggested some extension 
in the simple genetic algorithm to tackle these dynamic problems. Branke has 
suggested the following categories to group the algorithms proposed [2,3]: 

— Evolutionary algorithms detect every change in the environment. If it is 
detected some change, then new individuals are injected into the population 
to increase diversity. 

— Evolutionary algorithms which have an implicit memory. These algorithms 
use double or more complex representations (diploid, haploids) [4,5]. In a 
given moment just one representation is active. 

— Evolutionary algorithms which have an explicit memory to store useful in- 
formation of the past and it is recalled when the dynamic problem returns 
to a similar situation presented in the past [6,7]. 

— Evolutionary algorithms avoid every time the convergence. Genetic algo- 
rithms with sharing and random immigrants are examples of these kinds of 
algorithms. 

— Evolutionary algorithms use a multiple subpopulations to search the opti- 
mum or search a new optimum. 

Many authors have suggested dynamic problems to test algorithms, but some 
of them are too simple or too complex to use in the research area. Branke 
suggested a problem with a multidimensional landscape consisting of several 
peaks. The width, the height and the position of each peak can be altered slightly 
or abruptly when a change of the environment occur [8]. This benchmark will 
be used to test all the algorithms. 

The following sections will present the comparison between two new evolu- 
tionary algorithms inspired in exaptation, the simple genetic algorithms (SGA), 
and the SGA with two populations of Branke [8] to optimize the dynamic prob- 
lem presented in the moving peaks benchmark. Section II reviews the exaptation 
theory. Section III proposes two genetic algorithm inspired in exaptation ideas. 
Section IV reports some experiments and some results of the comparisons. Fi- 
nally, it is discussed some conclusions and future work in section V. 

2 Exaptation 

Gould and Vrba [9,10] proposed the term exaptation, which refers to a trait that 
current provides fitness, but originally arose for some other reason. Every entity 
(species) tries to survive in a continuous non-static environment. The entity 
has traits which lets survive in the environment. Some traits are useful because 
provide high fitness but another ones do not provide fitness, they are useless or 
redundants. Some traits may have evolved in one context of the environment but 
later, such a trait may be co-opted for use in a different role. In other words, the 
exaptation is a change in the function of an old trait to solve a new problem. 

Exaptation has three procedures. First, when there is a change in the en- 
vironment it is detected a set of possible traits with high fitness. The second 
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procedure reuse the possible useful traits with high fitness and adapt them to 
the new environment. The third procedure retains a useful trait for future ref- 
erences but the useless traits do not disappear completely, they are stored as 
redundant or useless structures. 

The initial population of a genetic algorithm is random. The SGA solves a 
problem and it takes several cycles to get an optimum or an individual with 
high fitness for the optimization function. In the end of the run the population 
have individuals that are very similar between them. If there is a change of the 
environment and some individuals are useless maybe some of them have useful 
traits. If the SGA runs again with a changed function and the same population, 
the useful traits may arise and they can let to get a new optimum quickly. In 
the SGA is possible to reuse the last population if the function does not change 
too much. If the change is important then the reuse of the last population can 
be useless. It is necessary to modify the SGA in order to get some features of 
exaptation and it can be used to solve dynamic problems. 



3 Genetic Algorithms with Exaptation 

The first algorithm proposed is the SGA with seeding techniques; this algorithm 
is inspired in exaptation because it cause the reuse of structures when the algo- 
rithm detects any change in the optimization function. If one change is detected 
then some variations or neighbors of the best solution found are injected into 
the actual population. The variations or neighbors replace a percent of the total 
population. It could be that some components (genes) of the best individual 
can be reused in the new function. The neighbors of the solution can give the 
appropriate solution if the problem changes slightly. The variations (mutations) 
of the solution can give a key if the problem changes abruptly. This algorithm 
can be grouped to genetic algorithms that detect a change and they apply an 
injection. 

The second algorithm is inspired in exaptation and it is implanted from the 
point of view of learning by analogy; this learning mechanism has a memory of 
useful solutions of the past, a storing procedure of solutions, a search mechanism 
and a modification procedure. The second algorithm applies similar procedures: 
A recognition procedure where it is used the evaluation of the best individual to 
be compared again the best evaluation found in the past. If there is a degradation 
of the past solution then there is a change in the objective function. Storing 
procedure saves the best individual found into the memory. It avoids to save 
the same individual in the memory two times. First it locates the most similar 
individual of the memory to the best individual found and if the best individual 
found has better evaluation than the individual of the memory then the best 
individual is stored into the memory. In another case, there is not change in 
the memory. This procedure tries to apply the exaptive property of retention of 
solutions. The modification procedure is based in seeding techniques. 
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The algorithms implemented are the SGA, the SGA with elitism, the SGA 
with two populations, the SGA with seeding and the SGA with memory and 
seeding. In the SGA that reuse the last population of the search, the population 
is initialized just at the beginning of the run. The algorithm is shown below. 
P is the population, Fe retains the evaluation value of every individual in the 
population. / is an auxiliary variable and it retains the best individual of the 
population per generation. The evaluation value of the Individual / is saved in e. 



Simple Genetic Algorithms 

1) P •<— Random initialization 

2) Fe ^ Evaluation(P) 

3) Get the best individual I and its evaluation e from Fe 

4) P -It- Selection(P, Fe) 

5) P -<r- Reproduction(P) (crossover and mutation) 

6) If end condition is not satisfied, then go to step (2) 

7) End 



Simple genetic algorithm with elitism reuse the last population of the search. 
This algorithm has the same features of the SGA, the difference is the elitism 
procedure included. The algorithm is the following: 



Simple genetic algorithm with elitism 

1) P •<— Random initialization 

2) Fe ^ Evaluation(P) 

3) Get the best individual I and its evaluation e from Fe 

4) P -It- Selection(P, Fe) 

5) P -<r- Reproduction(P) (crossover and mutation) 

6) Apply elitism inyecting I in P 

7) If end condition is not satisfied, then go to step (2) 

8) End 



The SGA with two populations proposed by Branke includes two populations, 
the memory population M and the search population P. Every k generations 
it is saved the best solution found. It searchs the minimum distance between 
the individual Ig and the individuals stored in the the memory, then the more 
fit individual is storing into the memory M. The detection of a change in the 
objetive function is by the comparison between the best evaluation value of 
the memory Cm, the actual search population e and the past evaluation value 
detected Cq. The algorithm re-initialize the search population P when a change 
is detected. 
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Genetic algorithm based in memory proposed by Branke 

1) P •<— Random initialization 

2) M •<— Random initialization or empty memory 

3) Get the best past evaluation 

4) FE -(r- Evaluation (P) 

5) Get the best individual / and its evaluation value e 
from Fe 

6) Fem ^ Evaluation(M) 

7) Get the best individual and its evaluation value 
from Fem 

8) Detect the best individual Ig between / and Im 

9) If max(em, e) < Ca then P •<— random inicialization 

10) Every k generations the best individual Ig is storing into 
the memory M 

11) P ^ S elect ion ( P, Pe) 

12) P ^ Reproduction(P) (crossover and mutation) 

13) Includes elitism inyecting Ig in P 

14) If end condition is not satisfied, then go to step (3) 

15) End 



The genetic algorithm with seeding injects neighbors of the best individual 
I when it detects a change in the objective function. 



Genetic algorithm with seeding techniques 

1) P •<— Random initialization 

2) Get the best past evaluation Co 
3>) FE ^ Evaluation(P) 

4) Get the best individual / and its evaluation value e 
from Fe 

5) If e < 6a then apply seeding of 50% of neighborn 
and variations of / in P 

6) P ^ Selection (P, Fe) 

7) P -(r- Reproduction (P) (crossover and mutation) 

8) Includes Elitism inyecting I in P 

9) If end condition is not satisfied, then go to step (2) 

10) End 



The SGA with memory and seeding has a memory to save the best solution 
found. The procedure is similar to the algorithm of Branke. The algorithm uses 
a SGA with seeding of neighbors and variations of 50%. The condition to make 
a seeding is when the evaluation value of the memory Cm is lower than the best 
past evaluation value Ca- When the memory detects a lower evaluation in the 
individuals of the memory this is an indication that the function changes and 
there is an unknown solution, so it is necessary to find a new one just with the 
information of the memory. 
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Genetic algorithm with memory and seeding techniques 

1) P •<— Random initialization 

2) M -i— Random initialization or empty memory 

3) Get the best past evaluation value e<j 

4) FEM ^ Evaluation (M) 

5) Get the best individual Im and its evaluation value Cm 
from Fem 

6) If em < Ca then applies seeding of 50% of 
Neighborn and variations of / in P 

7) FE <— Evaluation (P) 

8) Get the best individual I and its evaluation value e 
from Fe 

9) P <— Selection(P, Fe) 

10) P ^ Reproduction(P) (crossover and mutation) 

11) Every k generations saves in memory M the best individual / 

12) If end condition is not satisfied, then go to step (3) 

13) End 



4 Experimentation and Results 



For the experiment it is used a SGA with crossover probability of 0.8 and muta- 
tion rate of 0.025, a total population size of 100 individuals. The benchmark has 
five variables. Every variable is coded into a binary vector of size 8, consequently 
the chromosome has the size of 40 bits. It is used a selection tourment of size 2. 
Every generation involves 100 evaluations. For dynamic fitness function is useless 
to report the best solution found. It is reported on the offline-performance, which 
is the average of the best solutions at each time step T (ep{T) = ^ ■ J2t=i ^('^))> 
e(t) is the best solution per generation at time t. The number of values that are 
used for the average grows with time, so the curves tend to be smoother. For 
the comparison will be implanted the five algorithms of the last section. The 
memory of the algorithms is empty. 

The fitness function changes every 50 generations. There are two cases for 
testing. In the first one the location of every peaks stay at the same place but 
there are changes in the height and the weight of every peak (set s = 0, A = 1 in 
MPB). The following figure 1 shows the offline performance of four algorithms, 
the SGA (sga), the SGA with two populations (sga2p), the SGA with memory 
and seeding (sgams)and the SGA with seeding (sgas). The SGA with elitism 
(sgae) is not shown because has the same performance as the SGA. The perfor- 
mance of the algorithms suggest a minimum difference between the algorithm of 
two populations and the SGA with memory and seeding, so there are not a clear 
out-performance of one algorithm over other one. Nevertheless both algorithms 
are better than the SGA and the SGA with seeding. 
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Generations 



Fig. 1. Offline Performance of several approaches. The peaks location stay at the same 
place. 




Fig. 2. Offline Performance of several approaches. The peaks location changes of posi- 
tion. 

In the second case the peaks locations changes of position (set s = 0.5, A = 1 
in MPB). The figure 2 shows the offline performance of four algorithms, the SGA 
is not shown because it has similar performance of the SGA with elitism. The 
SGA with memory and seeding suggests better performance than SGA with two 
populations. 
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5 Conclusion 

In this paper I propose two algorithms inspired in exaptation ideas. The al- 
gorithms make intensive use of seeding techniques of the best solution useful 
in the moment. Both algorithms were tested in the Moving Peaks Benchmark 
suggested by Branke. it was implemented other algorithms to make compar- 
isons between them. It was shown that the algorithms proposed are competitive 
solving dynamic problems because they have similar performance. 

In future work there are two tendencies, first we want to get pure exaptive 
algorithms with retention and reuse implicit properties and the use of more 
controlled seeding techniques. The second tendency is to find more dynamic 
problems to test these algorithms like the problems found in dynamic learning. 
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Abstract. Artificial intelligence techniques have been developed through 
extensive practical implementations in industry in form of intelligent control. 
One of the most successful expert-system techniques, applied to a wide range of 
control applications, has been fuzzy logic technique. This paper shows the 
implementation of a fuzzy logic controller (EEC) to regulate the steam 
temperature in a 300 MW Thermal Power Plant. The proposed EEC was applied 
to regulate superheated and reheated steam temperature. The results show that 
the fuzzy controller has a better performance than advanced model-based 
controller, such as Dynamic Matrix Control (DMC) or a conventional PID 
controller. The main benefits are the reduction of the overshoot and the tighter 
regulation of the steam temperatures. Euzzy-logic controllers can achieve good 
result for complex nonlinear processes with dynamic variation or with long delay 
times. 



1 Introduction 

Over the last 15 years the complexity of the operation of thermal power plants has 
been increased significantly. Mainly by two factors: changes in the operating 
conditions and the increment of the age of the plants. Today, the operation of thermal 
power plants must be optimal considering higher productions profits, safer operation 
and stringent environment regulation. In addition, the reliability and performance of 
the plants is affected by its age. These factors increase the risk of equipment failures 
and the number of diagnoses and control decisions which the human operator must 
take [1, 2]. 

As a result of these changes, the computer and information technology have been 
extensively used in thermal plant process operation. Distributed control systems 
(DCS) and management information systems (MIS) have been playing an important 
role to show the plant status. The main function of DCS is to handle normal 
disturbances and maintain key process parameters in pre-specified local optimal 
levels. Despite their great success, DCS have little function for abnormal and non- 
routine operation because the classical Proportional-integral-derivative (PID) control 
is widely used by the DCS. PID controllers exhibit poor performance when applied to 
process containing unknown non-linearity and time delays. The complexity of these 
problems and the difficulties in implementing conventional controllers to eliminate 
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variations in PID tuning motivate the use of other kind of controllers, such as model 
based controllers and intelligent controllers. 

This paper proposes a model based controller such as Dynamic Matrix Controller 
and an intelligent controller based on fuzzy logic as an alternative control strategy 
applied to regulate the steam temperature of the thermal power plant. The temperature 
regulation is considered the most demanded control loop in the steam generation 
process. The steam temperature deviation must be kept within a tight variation rank in 
order to assure safe operation, improve efficiency and increase the life span of the 
equipment. Moreover, there are many mutual interactions between steam temperature 
control loops that have been considered. Other important factor is the time delay. It is 
well know that the time delay makes the temperature loops hard to tune. The 
complexity of these problems and difficulties to implement PID conventional 
controllers motivate to research the use of model predictive controllers such as the 
dynamic matrix controller or intelligent control techniques such as the fuzzy logic 
controller as a solution for controlling systems in which time delays, and non-linear 
behavior need to be addressed [3,4]. 

The paper is organized as follows. A brief description of the Dynamic Matrix 
Controller (DMC) is presented in Section 2. The fuzzy logic controller (FLC) design 
is described in Section 3. Section 4 presents the implementation of both controllers 
DMC and FLC to regulate the superheated and reheated steam temperature of a 
thermal power plant. The performance of the FLC controller was evaluated against 
two other controllers, the conventional PID controller and the predictive DMC 
controller. Results are presented in Section 5. Finally, the main set of conclusions 
according to the analysis and results derived from the performance of controllers is 
presented in Section 6. 



2 Dynamic Matrix Control 

The Dynamic Matrix Control (DMC) is a kind of model based predictive control. This 
controller was developed to improve control of oil refinement processes [5]. The 
DMC and other predictive control techniques such as the Generalized Predictive 
Control [6] or Smith predictor [6] algorithms are based on past and present 
information of controlled and manipulated variables to predict the future state of the 
process. 

The Dynamic Matrix Control is based on a time domain model. This model is 
utilized to predict the future behavior of the process in a defined time horizon. Based 
on this precept the control algorithm provides a way to define the process behavior in 
the time, predicting the controlled variables trajectory in function of previous control 
actions and current values of the process [7]. 

The control technique includes the followings procedures: 

a) Obtaining the Dynamic Matrix model of the process. In this stage, a step signal 
is applied to the input of the process. The measurements obtained with this activity 
represent the process behavior as well as the coefficients of the process state in time. 
This step is performed just once before the operation of the control algorithm in the 
process. 
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b) Determination of deviations in controlled variables. In this step, the deviation 
between the controlled variables of the process and their respective set points is 
measured. 

c) Projection of future states of the process. The future behavior of each controlled 
variable is defined in a vector. This vector is based on previous control actions and 
current values of the process. 

d) Calculation of control movements. Control movements are obtained using the 
future vector of error and the dynamic matrix of the process. The equation developed 
to obtain the control movements is shown below: 

A"^ = [A^A + f'l]''A^X" (1) 

where A represents the dynamic matrix, A^ the transposed matrix of A, X the vector 
of future states of the process, f a weighting factor, I the image matrix and A the 
future control actions. Further details about this equation are found in [5]. 

e) Control movements’ implementation. In this step the first element of the control 
movements’ vector is applied to manipulated variables. 

A DMC controller allows designers the use of time domain information to create a 
process model. The mathematical method for prediction matches the predicted 
behavior and the actual behavior of the process to predict the next state of the process. 
However, the process model is not continuously updated because this involves 
recalculations that can lead to an overload of processors and performance 
degradation. 

Discrepancies in the real behavior of the process and the predicted state are 
considered only in the current calculation of control movements. Thus, the controller 
is adjusted continuously based on deviations of the predicted and real behavior while 
the model remains static. 



3 Fuzzy Logic Control 

Fuzzy control is used when the process follows some general operating 
characteristic and a detailed process understanding is unknown or process model 
become overly complex. The capability to qualitatively capture the attributes of a 
control system based on observable phenomena and the capability to model the 
nonlinearities for the process are the main features of fuzzy control. The ability of 
Fuzzy Logic to capture system dynamics qualitatively and execute this qualitative 
schema in a real time situation is an attractive feature for temperature control 
systems [8]. 

The essential part of the fuzzy logic controller is a set of linguistic control rules 
related to the dual concepts of fuzzy implication and the compositional rule of 
inference [9]. 

Essentially, the fuzzy controller provides an algorithm that can convert the 
linguistic control strategy, based on expert knowledge, into an automatic control 
strategy. In general, the basic configuration of a fuzzy controller has four main 
modules as it is shown in the figure 1 . 
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Fig. 1. Architecture of a Fuzzy Logic Controller. 

In the first module, a quantization module converts to discrete values and 
normalizes the universe of discourse of various manipulated variables (Input). Then, a 
numerical fuzzy converter maps crisp data to fuzzy numbers characterized by a fuzzy 
set and a linguistic label (Fuzzification). In the next module, the inference engine 
applies the compositional rule of inference to the rule base in order to derive fuzzy 
values of the control signal from the input facts of the controller. Finally, a symbolic- 
numerical interface known as defuzzification module provides a numerical value of 
the control signal or increment in the control action. 

Thus the necessary steps to build a fuzzy control system are [10,11]: (a) Input and 
output variables representation in linguistic terms within a discourse universe; (b) 
Definition of membership functions that will convert the process input variables to 
fuzzy sets; (c) Knowledge base configuration; (d) Design of the inference unit that 
will relate input data to fuzzy rules of the knowledge base; and (e) Design of the 
module that will convert the fuzzy control actions into physical control actions. 



4 Implementation 

The control of the steam temperature is performed by two methods. One of them is to 
spray water on the steam flow, mainly in the super-heater. The sprayed water must be 
strictly regulated in order to avoid the steam temperature to exceed the design 
temperature range of ±1% (±5 °C). This makes sure the correct operation of the 
process, improvement of the efficiency and extension of lifetime of the equipment. 
The excess of sprayed water in the process can result in degradation of the turbine. 
The water in liquid phase impacts strongly on the turbine's blades. The other process 
to control the steam temperature is to change the burner slope in the furnace, mainly 
in the reheated. The main objective of this manipulation is to keep constant the steam 
temperature when a change in load is made. 
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The DMC, fuzzy logic and PID controllers were implemented in a full model 
simulator to control the superheated and reheated steam temperature. The simulator 
simulates sequentially the main process and control systems of a 300 MW fossil 
power plant. 



4.1 Dynamic Matrix Control (DMC) 

The matrix model of the process is the main component of the Dynamic Matrix 
Controller. In this case the matrix model was obtained by a step signal in both the 
sprayed water flow and the burners' position. 

Figure 2 shows a block diagram of the DMC implementation in the steam 
superheating and reheating sections. The temperature deviations were used as the 
controller's input. The sprayed water flow and slope of burners were used as the 
manipulated variables or controller's output. 




Fig. 2. Implementation of DMC controller 



4.2 Fuzzy Logic Control (FLC) 

Seven fuzzy sets were chosen to define the states of the controlled and manipulated 
variables. The triangular membership functions and their linguistic representation are 
shown in figure 4. The fuzzy sets abbreviators belong to: NB=Negative Big, 
NM=Negative Medium, NS=Negative Short, ZE=Zero, PS=Positive Short, 
PM=Positive Medium and PB=Positive Big. 

The design of the rule base in a fuzzy system is a very important part and a 
complex activity for control systems. Li et al [11] proposed a methodology to develop 
the set of rules for a fuzzy controller based on a general model of a process rather than 
a subjective practical experience of human experts. The methodology includes 
analyzing the general dynamic behavior of a process, which can be classified as stable 
or unstable (figure 4). 
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Fig. 3. Fuzzy sets of FLC 




Unstable response of a process 



Fig. 4. Common step response of a process 

In figure 3 the range of fuzzy sets are normalized to regulate the temperature 
within the 20% above or below the set point, the change of error within the ± 10 %, 
and the control action are considered to be moved from completely close or 0° 
inclination to completely open or 90° of inclination in water flow valve and slope of 
burners respectively. In the case of regulation of temperature, if the requirements 
change to regulate the temperature within a greater range, the methodology proposed 
by Li et al [11] considers to apply a scale factor in the fuzzy sets. 

Characteristics of the four responses are contained in the response show by the 
second stable response. The approach also uses an error state space representation to 
show the inclusion of the four responses in the second stable one (figure 5). 

A set of general rules can be built by using the general step response of a process 
(second order stable system): 

1. If the magnitude of the error and the speed of change is zero, then there is not 
necessary to apply any control action (keep the value of the manipulated variable). 

2. If the magnitude of the error is close to zero in a satisfactory speed, then there is no 
necessary any control action (keep the value of the manipulated variable). 

3. If the magnitude of the error is not close to the system equilibrium point (origin of 
the phase plane diagram) then the value of the manipulated variable is modified in 
function of the sign and magnitude of the error and speed of change. 
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Unstable response of a process 



Fig. 5. Time domain response in error state space representation. 




Et- - + + 

llE-.t + 



Fig. 6. Step response of a second order system. 

The fuzzy control rules were obtained observing the transitions in the temperature 
deviations and their change rates considering a general step response of a process 
instead of the response of the actual process to be controlled. The magnitude of the 
control action depends on the characteristics of the actual process to be controlled and 
it is decided during the construction of the fnzzy rules. A coarse variable (few labels 
or fuzzy regions) produces a large output or control action, while a fine variable 
produces small one. 

Fignre 6 shows a representative step response of a second-order system. Based on 
this figure, a set of rules can be generated. For the first reference range (I), it is 
necessary to use a fuzzy rule in order to rednce the rise time of the signal: 

IF E=PB and AE=NS THEN OA=PM (2) 

Where E means the deviation, AE denotes the change rate and OA determines the 
output action required to regulate the controlled variable. Another rnle can be 
obtained for this same region (I). The objective of this rnle will be to reduce the 
overshoot in the system response: 



IF E=PS and AE=NB THEN OA=NM 



( 3 ) 
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Table 1. Set of Rules of a FLC. 



Rule 


E 


AE 


Control Action 


Reference Point 


1 


PB 


ZE 


PB 


a 


2 


PM 


ZE 


PM 


e 


3 


PS 


ZE 


PS 


1 


4 


ZE 


NB 


NB 


b 


5 


ZE 


NM 


NM 


f 


6 


ZE 


NS 


NS 


j 


7 


NB 


ZE 


NB 


c 


8 


NM 


ZE 


NM 


g 


9 


NS 


ZE 


NS 


k 


10 


ZE 


PB 


PB 


d 


11 


ZE 


PM 


PM 


h 


12 


ZE 


PS 


PS 


1 


13 


ZE 


ZE 


ZE 


set point 


Reference Range 


14 


PB 


NS 


PM 


1 (rise time) 


15 


PB 


NB 


NM 


1 (overshoot) 


16 


NS 


PS 


NM 


III 


17 


NS 


PB 


PM 


III 


18 


PS 


NS 


ZE 


IX 


19 


NS 


PS 


ZE 


XI 



Analyzing the step response is possible to generate the fuzzy rule set for each 
region and point in the graph. Table 1 shows the set of rules obtained using this 
methodology. The second and third columns represent the main combinations 
between the error and its change rate of each variable. The forth column indicates the 
necessary control action to control the process condition. The last column shows the 
reference points and ranges that belong to each fuzzy rule. 



5 Controllers Performance 

In every case, the system was submitted to an increment in load demand [12]. The 
disturbance was a kind of ramp from 70% to 90% in the load. The load change rate 
was 10 MW/min, which represents the maximum speed of change in the load. The 
performance parameters evaluated belong to overshoot amplitude, response time, 
maximum value, integral error square, integral absolute error. 

Figure 7 shows the graphic results obtained by each controller in the super-heater 
and re-heater respectively. In both cases the stability of the steam temperature is 
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Fig. 7. Superheated and reheated steam temperature response . 
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widely improved by the advanced control algorithms. In the same way, the steam 
temperature deviation from the set point is tightly regulated. 

The DMC reduced the superheated steam temperature overshoot almost 30% and 
the response time 15% in relation to the PID controller response. The maximum 
deviation observed with respect to the reference is reduced 30% in relation to the PID 
controller. In the re-heater, the DMC reduced the steam temperature deviation almost 
65% and the response time 60% in relation to the PID controller. The maximum value 
reached by the steam temperature is reduced significantly in the same comparison 
(65%). 

When a fuzzy controller is applied to the superheated steam temperature with the 
same disturbance described before, the overshoot is reduced almost 80% in relation to 
the PID controller performance or 70% in relation to the DMC performance. There is 
no response time because the fuzzy controller keeps the steam temperature within a 
tight variation rank. The maximum deviation observed with respect to the reference is 
strongly reduced in relation to both the PID controller (80%) and the DMC (70%). 

There is no overshoot using the fuzzy controller because of the kind of the process 
response. There is also not response time because the fuzzy controller keeps the steam 
temperature within a tight variation rank. 



6 Conclusions 

The Dynamic Matrix Control algorithm has shown significant reduction and better 
stability in the superheated and reheated steam temperatures. However, the use of 
fuzzy logic theory to control the steam temperatures achieved better performance in 
the characteristics before mentioned. Moreover, the response time of the signals were 
considered as insignificant because the fuzzy controller kept the steam temperatures 
within a tight variation rank. 

The Dynamic Matrix controller includes the execution of matrix calculations that 
are in function of the number of inputs in the process. The speed of the DMC 
performance can be affected by this parameter. The fuzzy logic controller is based on 
solving arithmetic and logic equations independently of the number of inputs. 

Both Fuzzy and Dynamic Matrix controllers are suitable to use when there is not a 
model that represents the process. In the case of the DMC algorithm, it has the 
disadvantage of keeping the process out of the control point in order to get the matrix 
model. On the other hand, the fuzzy controller implementation is based on 
approximated reasoning and knowledge representation of the experience. The fuzzy 
controller considers the expert reasoning and knowledge from operators as well as a 
general model or dynamic behavior of a process. The main benefits of the fuzzy 
control are the reduction of the overshoot and the tight regulation of steam 
temperature. 
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Abstract. A nonlinear mathematical model of a feed-batch fermentation 
process of Bacillus thuringiensis (Bt.), is derived. The obtained model is 
validated by experimental data. Identification and direct adaptive neural control 
systems with and without integral term are proposed. The system contains a 
neural identifier and a neural controller, based on the recurrent trainable neural 
network model. The applicability of the proposed direct adaptive neural control 
system of both proportional and integral-term direct adaptive neural control 
schemes is confirmed by comparative simulation results, also with respect to 
the A,-tracking control, which exhibit good convergence, but the I-term control 
could compensate a constant offset and proportional controls could not. 



1 Introduction 

The recent advances in understanding of the working principles of artificial neural 
networks has given a tremendous boost to the application of these modeling tools for 
control of nonlinear systems, [1], [2], [3], [4]. Most of the current applications rely on 
the classical NARMA approach; here a feedforward network is used to synthesize the 
nonlinear map, [2], [4]. This approach is powerful in itself but have some 
disadvantages, [2]: the network inputs are a number of past system inputs and outputs, 
so to find out the optimum number of past values, a trial and error must be carried on; 
the model is naturally formulated discrete time with fixed sampling period, so if the 
sampling period is changed, the network must be trained again; the problem 
associated with the stability, convergence and the rate of convergence of this 
networks are not clearly understood and there is not a framework available for 
analysis in vector-matricial form, [4]; the necessary condition of the plant order to be 
known. Besides to avoid this difficulties, a new recurrent Neural Networks (NN) 
topology, and a Backpropagation (BP) learning algorithm, [5], derived in a vector- 
matricial form, has been proposed, and its convergence, has been studied. 

The adaptive control by Neural Networks (NNs) is of interest in Biotechnology for 
controlling metabolite production, [6], due to its adaptation to time varying process 
characteristics. A metabolite is a product of the microbial activity during the 
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microorganism metabolism. The microbial cultivation can be carried out in batch, fed- 
batch or continuous fermentation, [7]. In some references, the nonlinear mathematical 
model of this fermentation process is linearized and a classical control is designed 
based on this linear equations, [8]. The major disadvantage of this approach is that it 
is not adaptive and could not respond to the changing process characteristics. In [9] a 
comparative study of linear, nonlinear and neural-network-based adaptive controllers 
for a class of fed-batch baker’s and brewer’s yeast fermentation is done. The paper 
proposed to use the method of neural identification control, given in [4], and applied 
Feedforward (FF) NNs (Multilayer Perceptron - MLP and Radial Basis Functions NN 
- RBF). The proposed control gives a good approximation of the nonlinear plants 
dynamics, better with respect to the other methods of control, but the applied static 
NNs have a great complexity, and the plant order has to be known. The application of 
Recurrent NNs (RNN) could avoid this problem and could reduce significantly the 
size of the applied NNs. For the bioprocess of interest (fermentation of Bacillus 
thuringiensis), RNN has been applied only for systems identification, [10], process 
prediction, [11], and an inverse plant model feedforward neural control, [12]. In the 
present paper, it is proposed to apply a direct adaptive recurrent neural control with 
integral term, transforming the control scheme, given in [12], which seems to be 
appropriate to capture the nonlinear dynamics, given hy a mathematical model of the 
fed-batch fermentation of Bacillus thuringiensis (Bt.), and to track the system 
reference in presence of noise. 



2 Fed-Batch Fermentation Process Description 

List of symbols used. 

u(t) Nutrient feeding rate at a t time, h'\ 

Sf Substrate concentration in the feeding g/l. 

S(t) Substrate concentration in the culture at t, g/l. 

X(t) Biomass concentration in the culture at t, g/l. 

V(t) Culture volume into the fermentor at f, /. 

Sketch of the Fed-batch fermentor is shown on Fig.l. The fed-hatch fermentation 
model described a microorganism cultivation in a sterile reactor, [7], [11], maintained 
under operational conditions, adequate for microorganism growth at a desired specific 
growth rate (p). The operational conditions considered are: temperature of 30°C, pH 
of 7.1 and dissolved oxygen in a concentration greater than 40%, [11]. The nutrients 
are supply during the exponential phase of Bt. growth. 

To derive the model, the following considerations have been taken into account: 

1) Yield coefficient (Y) is constant during all the fermentation; 2) The substrate 
consumption for the maintenance cells is negligible; 3) The increase volume in the 
fermentor is equal to the nutrient volume fed; 4) Cell dead is considered negligible 
during the fermentation. The model is based on the derivation of the Mass balance 
equations of the fermentor, [9], which are as follows: 
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Fig.l. Fed-batch fermentation process of Bt. 

- Evolution of the culture volume: 

— = u{t) (1, 

at 

- Evolution of the total microorganism mass in the fermentor: 

^^ = n[S{t))x{t)V{t) ( 2 : 

- Evolution of the limiting substrate in the fermentor: 

= u(t)Sf tJ. 

dt ^ ^ * Y 

- The current specific growth rate |l(S(t)) is described by the Monod equation, [11]: 



/t(s(t)) = 






Where: is the maximal growth rate; is a Michaelis-Menten constant. The 

derivation of (2) and (3) gives: 

V(t) = u(t) (5) 






''(Oj 
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Where: V(t): is a growth function, representing the culture volume in the 

fermentor at t it is the current time); M(t): is the input to the fermentor at t; 5^ 

>0 is the substrate concentration in u(t) at f; S(t) is the substrate concentration in the 
culture at t; X(t) is the cell concentration in the culture at f; finally, Y>0 and 
9i>oX 91 >„ — ^91>j are continuous functions for |d(f,0)=0 V f>0. The existence of positive 
solution of the equations (5), (6), (7) has been proved by a theorem, [13]. 



3 Topology and Learning of the Recurrent Neural Network 

A Recurrent Trainable Neural Network model (RTNN), and its learning algorithm of 
dynamic Backpropagation-type, (BP), together with the explanatory figures and 
stability proofs, are described in [5]. The RTNN topology, given in vector-matricial 
form, is described by the following equations: 



X{k + \) = AX(k) + BU(K) 


(8) 


Z{k) = S[X{k)] 


(9) 


Y{k) = S[CZ{k)] 


(10) 


A = block-diag {an ) ; a,-; < 1 


(11) 



Where: Y, X, and U are, respectively, output, state and input vectors with dimensions 
1, n, m; A = block-diag (m) is a (nxn)- state block-diagonal weight matrix; m is an i-th 
diagonal block of A with (1x1) dimension. Equation (11) represents the local stability 
conditions, imposed on all blocks of A; B and C are (nxm) and (Ixn)- input and output 
weight matrices; S is vector-valued sigmoid or hyperbolic tangent-activation function, 
[5]; the sub-index k is a discrete-time variable. The stability of the RTNN model is 
assured by the activation functions and by the local stability condition (11). The most 
commonly used BP updating rule applied, [5], is given by: 

Wij (k + l)= Wij (k) + tjAWij (k) + aAWij (k - 1) (12) 

Where: W. is a general weight, denoting each weight matrix element (C,j, A;., Bjj) in 
the RTNN model, to be updated; AW,j, (AC-, AJ,^ ABj.), is the weight correction of W,p 
while; ri and a are learning rate parameters. The weight updates are computed by the 
following equations: 



AC- ik) = [r. ik) - (k)]5; (Y, (k))Z, ik) 


(13) 


AAij{k) = R Xiik-l) 


(14) 


R = C,(kmk)-Y(k)]S\(Z,m 


(15) 


ABijik) = R Ui(k) 


(16) 



Where: AAj^ , ABj^ , AC- are weight corrections of the weights J-, B-, C-, respectively; 
(T-Y) is an error vector of the output RTNN layer, where T is a desired target vector 
and Y is a RTNN output vector, both with dimensions 1; Xj is an i-th element of the 
state vector; R is an auxiliary variable; S7 is derivative of the activation function. 
Stability proof of this learning algorithm, is given in [5]. The described above RTNN 
is applied for identification and adaptive control of a feed-batch fermentation process. 
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Fig.2. a) Block-diagram of the direct adaptive neural control with I-term. b) Block-diagram of 
the direct adaptive neural control without I-term. 



4 A Direct Adaptive Neural Control System with Integral Term 

Block-diagrams of the proposed direct adaptive control system with and without I- 
term are given on Fig. 2. a), b). The control scheme, given on Fig. 2. a), contains two 
RTNNs and a discrete-time integrator. The first RTNN is a neural identifier which 
delivers a state estimation vector as an entry to the second RTNN which is a neural 
controller. An additional entry to the neural controller is the control error 
(k) = y p (^) ■ The integral of the control error ejk) is an additional part of 

the control. The neural identifier RN-1 is learned by the identification error 
ejik) = yp(k)- ^pik) and the neural controller is learned by the control error eJk) 

using the backpropagation algorithm, given by equations (12). The linear 
approximation of the neural identifier RN-1 is given by the equations: 

x^(k + 1 ) = Ax^(k) + Bu(k) (1'^) 

y p(k) = CXg(k) (18) 

The neural controller RN-2 equations also could be expressed in linear form, as it is: 

u{k) = C*x\k) 

x*{k + \) = A* x\k) - Blx^ik) + Ble^ik ) ; (19) 

ec(k) = ymik)-yp{k) 

Where: X * (k) is n^-dimensional state vector of the neural controller; U* {k}is 

an input vector with dimension m; £^{k} is a control error with dimension 1. The 

reference signal yjk) is a train of pulses and the control objective is that the control 
error tends to zero when k tends to infinity. The control action is the sum of two 
components: 
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u(k) = u‘{k) + u*{k) 


(20) 


(k + V) = (k) + TQKie(. (k) 


(21) 


Where: u(k) is the output of the integrator with dimension 1 (here 1 = m is supposed); 
Of is an offset variable with dimension m, which represents the plant’s imperfections; 
Tq is the period of discretization; Ki the integrator (1x1) gain matrix. Applying the z- 
transformation we could obtain the following expressions and z-transfer functions: 


u‘(z) = (z-irXK>e,(z) 


(22) 


* * _1 * 
qi(z) = C (zI-A ) ^Bi 


(23) 


* * _1 * 
q 2 iz) = C (Zl-A ) 


(24) 


p{z) = {zl - A)~^ B \ x^(z)= p(z)u(z) 


(25) 


W‘’(z) = C^izI-A^) 
y^iz) = W’iz)[uiz) + Ofiz)] 


(26) 



Where: the equation (22) define the I-term dynamics; the equation (25) describes the 
dynamics of the hidden layer of the neural identifier RN-1, derived from the equation 
(17); the equation (26) represents the plant dynamics, derived in linear form, by 
means of a state-space model. The linearized equation (19) of the RN-2 also could be 
written in z-operator form, using the transfer functions (23) and (24), which yields: 

M*(z) = -q^{z)x^{z) + q2{z)e^{z) (27) 

Substituting X^(z) from (25) in (27), and then the obtained result, together with (22) 

in the z-transformed equation (20), after some mathematical manipulations, we could 
obtain the following expression for the control signal: 

u(z) = [I + qiiz)p(z)] '^[(z - + q^(z)]e^(z) *^^8) 

The substitution of the error C^(z) from (19) and the control w(z) from (28) in the 
plant’s equation (26) after some mathematical manipulation finally give: 

{(z - 1)/ + W‘’iz)[I + qAz)p(z)] -^[T,K, + iz - DqAzmy.iz) 

= W‘’iz)[I + qAz)piz)]-^[T,K^+iz-l)q 2 iz)]yJz) + (29) 

+ iz-lW‘’iz)Ofiz) 

From the equation (29), which describes the dynamics of the closed-loop system with 
I-term neural control, we could conclude that if the plant (26) is stable with minimum 
phase and the neural networks RN-1, 2 are convergent, which signifies that the 
transfer functions from (23) to (25) are also stable with minimum phase, so the 
closed-loop system, described by (29) will be stable. The term (z-1) is equivalent of a 
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discrete-time derivative, so that a constant offset Of(k) could be compensated by the 
I-term and the control error tends to zero ( (k) — ^ 0 ). 



5 Simulation Results 

The fed-batch fermentation process model, given in part two, together with the initial 
condition values of the variables are used for simulation. These parameters are: 
V(0) = 3/, minimal volume required to operate the fermentor; X(0) = 3.58g// 
bacillus concentration contained in the inoculate used for starting the Bacillus 
thuringiensis fermentation; 5(0) = 15.6g/Z , initial concentration of glucose, which 
was taken as the limiting substrate; 5 ^= 34 . 97 g//, ; = 1.216/i”' , maximal 

specific growth rate reached for the applied operating conditions (X(0) and S(0), 
principally); Km = 5 and Y = 7.5 are average values calculated from the experimental 
data available and obtained for the applied operating conditions previously indicated. 
The topology and learning parameters of the neural identifier RN-1 are: (1, 7, 1), 
;7 = 0.1, and « = 0.01. The neural identification is done in closed loop where the 

input of the plant u{t) is the same as the input of the RN-1, and it is computed 
applying the ^-tracking method, [13], which is as follows: 

e{t) = yp{t)-y^{t) 

y 0 ■si I ^(0| ^ 

Where: = 0.65, Z= 0.0025, S=33 and r=l. The period of discretization is chosen 

as 7b = 0.01 , which signifies that it is equivalent to 1 hour of the real process time. 
After the identification is completed the neural controller RN-2 changes the X- 
tracking controller and both RTNNs continue with its learning. There are simulated 2 
continuous operation cycles of 21 hours each. The neural controller RN-2 has the 
topology (8, 7, 1) and the learning parameters are: 77=0.75 ; a = 0 . 01 . The plant 

output yp(k) is normalized so to be compared with the output of the neural identifier 
RN-1, y p{k) , and to form the identification error e,(^), utilized to learn its 

weights. The graphical simulation results obtained applying an I-term neural control 
are given on Fig. 3 A) from a) to h) and that using only a proportional control, are 
given on Fig. 3 B), from a) to d). For sake of comparison, both control results are 
compared with the results of ^-tracking control. In the three cases a 10% constant 
offset is added to the control signal. For all these cases of control used here: a) 
represents the output of the plant (biomass concentration), y^(k), compared with the 
reference signal, y^ (k) , in two cycles of 21 hour each; b) compare the output of the 
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Fig.3. A). Simulation results with direct adaptive control containing I-term. a) comparison 
between the plant output yp(k) and the reference signal ym(k) [g/T\\ b) comparison between the 
plant output yp(k)y and the output of neural identifier RN-1; c) MSE% of approximation; d) 
MSE% of control; e) evolution of the substrate consumption in the bioreactor S(k) [g/1]; f) 
evolution of the operation volume in the bioreactor V(k) [/] ; g) control signal, generated by 
RN-2, u(k) [ l/h\\ h) states of the RN-1, used for control. B). Simulation results using only 
direct adaptive proportional control. C). Simulation results using X,-tracking control. Eor the 
three control schemes figures a) to d) signifies the same thing and could not be repeated. 



plant, y^(k), with the output of the neural identifier RN-1, (k ) ; c) represents the 

Means-Squared-Error (MSE %) of approximation, which is about 1% in the end of 
the learning; d) represents the MSE % of control. Only for the case A) - e) shows the 
S(k) substrate evolution on time; f) represents the volume of operation V(k) in the 
bioreactor which increments from the fourth hour of operation, due to the control 
input influent introduced in the bioreactor; g) is the alimentation flux in the bioreactor 
u(k); h) represents the state variables of the neural identifier RN-1, used for control. 
The comparison of the graphical results, obtained with the control, containing I-term 
(Fig. 3A), the control which not contain I-term (Fig. 3B), and the ^-tracking control 
(Fig. 3C), shows that the 10% offset caused an displacement of the plant output with 
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respect of the reference and a substantial increment of the MSE% of control (from 4% 
to 8% for the proportional neural control and to 12% for the ^-tracking control), 
which means that proportional control systems could not compensate the offset at all 
and have static errors. 



6 Conclusions 

A nonlinear mathematical model of a feed-batch fermentation process of Bacillus 
thuringiensis (Bt.), is derived. The obtained model is validated by experimental data. 
Schemes of direct adaptive neural control system with and without integral term are 
proposed. The system contains a neural identifier and a neural controller, based on the 
recurrent trainable neural network model. The applicability of the proposed direct 
adaptive neural control system of both proportional and integral-term direct adaptive 
neural control schemes is confirmed by comparative simulation results, also with 
respect to the ^-tracking control, which exhibit good convergence, but the I-term 
control could compensate a constant offset and proportional controls could not. 
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Abstract. The paper proposed a new fuzzy-neural recurrent multi-model for 
systems identification and states estimation of complex nonlinear mechanical 
plants with friction. The parameters and states of the local recurrent neural 
network models are used for a local direct and indirect adaptive trajectory 
tracking control systems design. The designed local control laws are 
coordinated by a fuzzy rule based control system. The applicability of the pro- 
posed intelligent control system is confirmed by simulation and comparative 
experimental results, where a good convergent results, are obtained. 



1 Introduction 

In the recent decade, the Neural Networks (AW) became universal tool for many 
applications. The AW modeling and application to system identification, prediction 
and control was discussed for many authors [1], [2], [3]. Mainly, two types of AW 
models are used: Feedforward (FFNN) and Recurrent (RAW). All drawbacks of the 
described in the literature WA models could be summarized as follows: 1) there exists 
a great variety of AW models and their universality is missing, [1], [2], [3]; 2) all AA 
models are sequential in nature as implemented for systems identification (the FFNN 
model uses one or two tap-delays in the input, [1] and RNN models usually are based 
on the autoregressive model, [2], which is one-layer sequential one); 3) some of the 
applied RNN models are are not trainable in the feedback part; 4) most of them are 
dedicated to a SISO and not to a MIMO applications, [2]; 5) in more of the cases, the 
stability of the RNN is not considered, [2], especially during the learning; 6) in the 
case of FFNN application for systems identification, the plant is given in one of the 
four described in [1] plant models, the linear part of the plant model, especially the 
system order, has to be known and the FFNN approximates only the non-linear part of 
this model, [1]; 7) all these AA models are nonparametric ones, [3], and so, not 
applicable for an adaptive control systems design; 8) all this AA models does not 
perform state and parameter estimation in the same time, [3]; 9) all this models are 
appropriate for identification of nonlinear plants with smooth, single, odd, 
nonsingular nonlinearities, [1]. Baruch et all, [4], in their previous paper, applied the 
state-space approach to describe RNN in an universal way, defining a Jordan 
canonical two - or three-layer RNN model, named Recurrent Trainable Neural 
Network (RTNN), and a dynamic Backpropagation(BP) algorithm of its learning. This 
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NN model is a parametric one, permitting the nse of the obtained parameters and 
states during the learning for control systems design. Furthermore, the RTNN model is 
a system state predictor/estimator, which permits to use the obtained system states 
directly for state-space control. For complex nonlinear plants, Baruch et al, [5], 
proposed to use a fuzzy-neural multi-model, which is also applied for systems with 
friction identification, [6], [7]. In [8] a wide scope of references using fuzzy-neural 
approach for nonlinear plants approximation is given and the RNN architecture of 
Frasconi-Gori-Soda, is used. The main disadvantage of this work is that the applied 
RNN model there is sequential in nature. Depending on the model order, this RNN 
model generates different computational time delays, which makes difficult the fuzzy 
system synchronization. So, the aim of this paper is to go ahead, using the RTNN as 
an identification and state estimation tool in direct and indirect adaptive fuzzy-neural 
multi-model based control systems of nonlinear plants, illustrated by representative 
simulation and comparative experimental results. 



2 Models Description 

2.1 Recurrent Neural Model and Learning 

The RTNN model is described by the following equations, [4] : 



X(kH-l) = JX(k)H-BU(k) 


(1) 


Z(k)=S[X(k)] 


(2) 


Y(k) = S[CZ(k)] 


(3) 


J = block-diag (J|); | J J < 1 


(4) 



Where: X(k) is a N - state vector; U(k) is a M- input vector; Y(k) is a L- output 
vector; Z(k) is an auxiliary vector variable with dimension L; S(x) is a vector-valued 
activation function with appropriate dimension; J is a weight-state diagonal matrix 
with elements Jj ; equation (4) is a stability conditon; B and C are weight input and 
output matrices with appropriate dimensions and block structure, corresponding to the 
block structure of J. As it can be seen, the given RTNN model is a completely parallel 
parametric one, with parameters - the weight matrices J, B, C, and the state vector 
X(k), so it is useful for identification and control purposes. The controllability and 
observability of this model are considered in [4]. The general BP learning algorithm is 
given as: 



W/k-Hl) = W/k) -HI AW/k) -ta AW,/k-l) (5) 

Where: Wjj (C, J, B) is the ij-th weight element of each weight matrix (given in 
parenthesis) of the RTNN model to be updated; AW. is the weight correction of W.; 
T], a are learning rate parameters. The updates AC. , AJ., AB. of C. , J., B. are 
given by: 
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AC/k) = [T/k) -Y/k)l S7(Y/k)) Z(k) 


(6) 


AJ/k) = R,X(k-l) 


(7) 


AB/k) = R, U,(k) 


(8) 


R, = Q(k) [T(k)-Y(k)l S/(Z/k)) 


(9) 



Where: T is a target vector with dimension L; [T-Y] is an output error vector also 
with the same dimension; Rj is an auxiliary variable; S’(x) is the derivative of the 
activation function, which for the hyperbolic tangent is S7(x) = 1-x^. The stability of 
the learning algorithm and its applicability for systems identification and control, are 
proven in [4], where the results of a DC motor neural control, are also given. 



2.2 Fuzzy-Neural Multi-model 

For complex dynamic systems identification, the Takagi-Sugeno fuzzy rule, cited in 
[8], admits to use in the consequent part a crisp function, which could be a static or 
dynamic (state-space) model. Some authors, referred in [8], proposed as a consequent 
crisp function to use a NN function. Baruch et all. [5], [6], [7], proposed as a 
consequent crisp function to use a RTNN function model, so to form a fuzzy-neural 
multi-model. The fuzzy rule of this model is given by the following statement: 

R: IFxis A,THENyj(k-Hl)=R[x(k), u(k)],i=l,2,.., P (10) 

Where: N, (.) denotes the RTNN model, given by equations (1) to (3); i -is the model 
number; P is the total number of models, corresponding Ri. The output of the fuzzy 
neural multi-model system is given by the following equation: 

Y= Ej Wj y^ = X W| N,(x,u) (11) 

Where w, are weights, obtained from the membership functions, [9]. As it could be 
seen from the equation (11), the output of the approximating fuzzy- neural multi- 
model is obtained as a weighted sum of R77\W functions, [9], given in the consequent 
part of (10). In the case when the intervals of the variables, given in the antecedent 
parts of the rules are not overlapping, the weights obtain values one and the weighted 
sum (11) is converted in a simple sum, and this simple case, called fuzzy-neural 
multi-model, [7], [8], [11], will be considered here. 
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3 An Adaptive Fuzzy-Neural Control System Design 

3.1 A Direct Adaptive Fuzzy-Neural Control 

The structure of the entire identification and control system contains a fuzzyfier, a 
Fuzzy Rule-Based System (FRBS), and a set of RTNN models. The system does not 
need a defuzzyfier, because the RTNN models are crisp limited nonlinear state-space 
models. The direct adaptive neural multimodel control system incorporates in its 
FRBS a set of /?77\W controllers. The control fuzzy rules are: 

Rt If X is A; then u^ = U,(k), i=l, 2 L (12) 

U,(k) = -N^,[x(k)]+N„Jr,(k)] (13) 

Where: r(k) is the reference signal; x(k) is the system state; [Xi(k)] and Nj,j [rj(k)] 
are the feedforward and feedback parts of the fuzzy-neural control. The total control, 
issued by the fuzzy neural multi-model system is described by the following equation: 

U(k)= X Wj U| (k) (14) 

Where w^are weights, obtained from the membership functions, [9], corresponding to 
the rules (12). As it could be seen from the equation (14), the control could be 
obtained as a weighted sum of controls, [11], given in the consequent part of (12). In 
the case when the intervals of the variables, given in the antecedent parts of the rules, 
are not overlapping, the weights obtain values one and the weighted sum (14) is 
converted in a simple sum, and this particular multi-model case will be considered 
here. Block-diagram of a direct adaptive multi-model control system is given on Fig. 
1 . The block-diagram contains a fuzzy-neural identifier and a fuzzy-neural controller. 



3.2 An Indirect Adaptive Fuzzy-Neural Control 

The block diagram of the indirect adaptive fuzzy-neural control system is given in 
Fig. 2. The structure of the system contains a fuzzy-neural identifier, which contains 
two neural models RTNN-1,2 issuing a state (X;(k)) and parameter (Ji, Bi, Ci) 
information to local linear controllers. The multimodel control is given by the same 
rule (12), but here the local control, [11], is given by: 

U.(k) = (C, B;)‘‘(C, Jj X(k) H- r(k-Hl) h- y [r(k) - Y.(k)] }; i=l,2,..,P (15) 

In this particular case, we use only two neural nets for process identification. The 
RTNN-1 corresponds to the positive part of the plant output signal, and the RTNN-2 
corresponds to the negative one. For this two neural models - two correspondent 
controls Uj(k) and U 2 (k) are computed by equation (14), where the control parameter y 
is a real number between 0.999 y 0.999 and q(k) is the correspondent local reference 
signal. If the RTNN is observable and controllable, then the local matrix product Cj Bj 
is different from zero (C, B- ^0). 
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Fig.l. Block-diagram of the direct adaptive fuzzy-neural multimodel control system. 



r(k) 




Fig.2. Block-diagram of the indirect adaptive fuzzy-neural multimodel control system. 

Next both adaptive control schemes using the multi-/?77\W model will be applied 
for a real-time identification and control of nonlinear mechanical system. It is 
expected that the application of a learning adaptive model like the fuzzy-neural RTNN 
multi-model will be well suited for identification and control of such nonlinear 
process with unknown variable parameters and dynamic effects. 



4 Simulation Results 

Let us consider a DC-motor - driven nonlinear mechanical system with friction (see 
[9] for more details about the friction model), to have the following friction 
parameters: a = 0.001 m/s; p; = 4.2 N; p; = - 4.0 N; AP" = 1.8 N ; AF = - 1.7 N ; v„ 
= 0.1 m/s; p = 0.5 Ns/m. Let us also consider that position and velocity measurements 
are taken with period of discretization To = 0.1 s, the system gain ko = 8, the mass m 
= 1 kg, and the load disturbance depends on the position and the velocity, (d(t) = 
djq(t) + d 2 v(t); dj = 0.25; d^ = - 0.7). So the discrete-time model of the 1-DOP mass 
mechanical system with friction is obtained in the form, [6], [7]: 
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xXk+l) = x,(k) (16) 

x,(kH-l)=-0.025Xj(k)-0.3x,(k)H-0.8u(k)-0.1fr(k) (17) 

v(k) = x,(k) - x/k) (18) 

y(k) = 0.1x,(k) (19) 



Where: Xj(k), x^(k) are system states; v(k) is system velocity; y(k) is system position; 
fr(k) is a friction force, taken from [9], with given up values of friction parameters. 
The graphics of the simulation results, obtained with the direct adaptive control 
system, given on Fig. 1, are shown on Fig. 3.a,b,c,d. The graphics on Figure 3. a 
compare the output of the plant with the reference which is r(k)= 0.8 sin(2jik). To 
form the local neural controls, this reference signal is divided in two parts - positive 
and negative. The time of learning is 60 sec. The two identification RTNNs have 
architectures (1, 5, 1) and the two feedback control RTNNs have architectures (5, 5, 
1). The two feedforward RTNNs architectures are (1, 5, 1). The learning parameters 
are rj = 0.1, a = 0.2. Fig. 3.b shows the results of identification, where the output of 
the plant is divided in two parts, which are identified by two RTNNs. The combined 
control signal and the MSE% of control are given on Fig. 3. c, d. As it could be seen 
from the last graphics, the MSE% rapidly decreases, and reached values below 1%. 

The graphics of the simulation results, obtained with the indirect adaptive control 
system, given on Fig. 2, are shown on Fig. 4.a,b,c,d. Fig. 4. a compare the output of 
the plant with the reference signal (r(k)= 10 sin(2jtk)), which is divided in two parts - 
positive and negative. The time of learning is 100 sec. The two identification and the 
two EE control RTNNs have topologies (1, 5, 1). The learning rate parameters are r| = 
0.001, a = 0.01, and the control parameter is y=0.1. Fig. 4.b shows the results of 
identification, where the output of the plant is divided in two parts, identified by two 
RTNNs. The state and parameter information, issued by the identification multimodel 
is used to design a linear control law. The combined control signal and the MSE% of 
control are given on Fig. 4.c,d. As it could be seen from the last graphics, the MSE% 
rapidly decreases, and reached values below 1.5 %. 



5 Experimental Results 

In this section, the effectiveness of the multi-model scheme is illustrated by a real- 
time DC-moior position control, [4], using two RTNNs, (positive and negative) for the 
systems output, control and reference signals. A 24 Volts, 8 Amperes DC motor, 
driven by a power amplifier and connected by a data acquisition control board (Multi- 
Q™ ) with the PC, has been used. The RTNN was programmed in MatLab™- 
Simulink™ and WinCon™ , which is a real-time Windows 95 application that runs 
Simulink generated code using the real-time Workshop to achieve digital real-time 
control on a PC. The load is charged and discharged on the DC motor shaft by means 
of electrically switched clutch. The control signal is designed in two parts - feedback 
and feedforward. For sake of simplicity, the feedback part is realized as a P-controller 
(Kp=5) and only a feedforward part of the fuzzy-neural control scheme is applied, 
which allow us to omit the system identification. 
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Fig. 3. Graphical results of simulation using direct adaptive fuzzy-neural multimodel control, a) 
Comparison of the output of the plant and the reference signal; b) Comparison of the output of 
the plant and the outputs of the identification RTNNs; c) Combined control signal; d) Mean 
Squared Error of control (MSE%). 

So, the neural multimodel feedforward controller is realized by means of two 
RTNNs - positive and negative. The graphics of the experimental results, obtained 
with this control are given on Fig. 5.a-e. Fig. 5. a, b compare the DC motor shaft 
position with the reference signal ( r(k) = 1.5* [sin(0.2 * A:)] ) in the absence of load 
(Fig. 5a) (0-45 sec) and in the presence of load (Fig. 5b) (70-150 sec.). The control 
signal and the MSE% of reference tracking are given on Fig. 5c, d, respectively for the 
complete time of the experiment (0-200 sec.). The architectures of the two feed- 
forward RTNNs are (1, 5, 1). The learning rate parameters are r| = 0.003, a = 0.0001 
and the period of discretization is To = 0.001 sec. The MSE% exhibits fast 
convergence and rapidly reached values below 1%. For sake of comparison, the same 
experiment was repeated for DC-motor position control with PD controller (Kp=3, 
Kd=0.05). The results obtained without load are almost equal to that of Fig. 5a, but 
when the load is charged on DC-motor shaft, a lack of tracking precision is observed 
(see Fig. 5e, 45-100 sec.). 

The obtained values of the MSE% for both control experiments (with a neural 
multimodel control - 0.761% and with a PD control - 0.286%) show that the 
precision of tracking of the neural multimodel control is about two times greater that 
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Fig. 4. Graphical results of simulation using indirect adaptive fuzzy-neural multimodel control, 
a) Comparison of the output of the plant and the reference signal; b) Comparison of the output 
of the plant and the outputs of the identification RTNNs; c) Combined control signal; d) Mean 
Squared Error of control (MSE%). 

that, obtained by means of the PD control. Furthermore, the neural multimodel control 
could adapt to a load variation and the PD control needs gain update. 



6 Conclusions 

A two-layer Recurrent Neural Network (RNN) and an improved dynamic 
Backpropagation method of its learning, is described. For a complex nonlinear plant 
identification and control, a fuzzy-neural multi-model, is used. The fuzzy-neural 
multi-model, containing two RNNs, is applied for real-time identification and direct 
adaptive control of nonlinear mechanical system with friction, where the simulation 
results exhibit a good convergence. The obtained comparative experimental results of 
a DC-motor control are also acceptable, which confirms the applicability of the 
proposed fuzzy neural multi-model control scheme. 






Fig. 5. Experimental results of a DC motor position control using direct adaptive neural 
multimodel control scheme. Comparison of the output of the plant and the reference signal: a) 
without load (0-45 sec.); b) with load (70-150 sec.); c) Combined control signal (0-200 sec.); d) 
Mean Squared Error of control; e) Comparison of the output of the plant and the reference 
signal for control with PD controller and load on a DC-motor shaft (45-100 sec.). 
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Abstract. The fuzzy controllers could be broadly used in control processes 
thanks to their good performance, one disadvantage is the problem of fuzzy 
controllers tuning, this implies the handling of a great quantity of variables like: 
the ranges of the membership functions, the shape of this functions, the 
percentage of overlap among the functions, the number of these and the design 
of the rule base, mainly, and more even when they are multivariable systems 
due that the number of parameters grows exponentially with the number of 
variables. The importance of the tuning problem implies to obtain fuzzy system 
that decrease the settling time of the processes in which it is applied. In this 
work a very simple algorithm is presented for the tuning of fuzzy controllers 
using only one variable to adjust the performance of the system. The results will 
be obtained considering the relationship that exists between the membership 
functions and the settling time. 



1 Introduction 

The methodology for tuning a controller some times is a heuristic work. Some 
elements are important to consider in the tuning, like the bandwidth, the error in 
steady state, or the speed of response. It is possible to use this information or to make 
different tests to find the optimal parameters. In the case of a PID controller it is 
necessary to find three parameters (proportional gain, derivative time, and integral 
parameters). In the case of fuzzy controllers, there are many parameters to compute 
like, number of membership functions, ranges of every function, the rules of 
membership functions, the shape of this functions, the percentage of overlap, etc. [1- 
3]. Many people prefer to use a very well known PID controller that a fuzzy controller 
with all these parameters to estimate. This is a very important reason that in the 
industry this intelligent controller is not used. 

The tuning of any controller's type implies the adjustment of the parameters to 
obtain a wanted behavior or a good approach with a minimal error to the desire 
response. The different methods published in the area for the problem of fuzzy 
controllers' tuning use methodologies like evolutionary computation [4-6] and 
artificial neural networks [7-9]. This methods search the solution according to 
objective functions, parameter estimation, gradient error, etc., but in many cases these 
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alternatives have serious convergence problems, either a very complex mathematical 
representation, the computation time is very big, or it is possible that the solution 
computed is only a local minima of the solution. 

In this paper a very simple method for tuning fuzzy controllers is presented using 
only one parameter. In this case the paper is based in the relation between the 
stabilization time and the range of the membership functions. To explain the paper is 
structured in the following way: In section 2 of this work the relationship that exists 
among the location of the membership functions with the transfer characteristic is 
presented. In section 3 the system used is described and the controller's description for 
the tuning is presented. The section 4 outlines an algorithm of parametric tuning that 
modifies the operation points that define the group of membership functions. In the 
section 5 the results of the simulations are shown for different values of the tuning 
factor and different graphics that show the behavior of the settling time in function of 
the tuning factor. The simulations were carried out using Simulink of Matlab. Finally, 
some conclusions end this paper. 



2 Transfer Characteristic of a Fuzzy Controller 

The transfer characteristic allows defining the fuzzy controller's behavior in its answer 
speed, sensitive and reaction under disturbances, using the location of the operation 
points. As it will be described later on, this is related with the election of the fuzzy 
controller's gain dy / dx where y is the output and x is the input of the system, in 

different regions of the domain x, since given a domain x, the localization of the 

operation points determines the slopes of the transfer characteristic in different parts 
of the domain x. 

Case 1. For a flat slope in the middle of the domain x and increasing slopes toward 
increasing \x\ values, choose larger distances between operations points in the middle 
of domain (see figure 1). This means: 

For I ^2 I > I Tj I ^ \dy/dx\^^ > \dy/ dx 

Case 2. For a steep slope in the middle of the domain x and decreasing slopes 
toward increasing |x| values, choose smaller distances between operations points in 
the middle of domain (see figure 2). This means: 

For I ^2 I > I Tj I ^ \dy/dx\^^ < \dy/dx 

Option 1 should be chosen if for small errors a slow reaction to disturbances of the 
system under control is required and option 2 should be chosen if for small errors the 
system is supposed to be sensitive with respect to disturbances. 

Note that in the previous figures the values for the intervals of the membership 
function are important for the slopes and the speed of the controller response. If the 
membership functions “expand” (figure 1) then the response is slower than a 
compress group of membership functions (figure 2). 
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M(y) A 





Fig. 1. Relationship between the location of Fig. 2. Relationship between the location of 
the membership functions and the transfer the membership functions and the transfer 
characteristic for the case 1. characteristic for the case 2. 



3 Fuzzy Controller 

For the analysis and simulations with the tuning algorithm a second order system has 
been considered: 

1 ( 1 ) 

0.45^" + 2s + l 

overdamped with a damping ratio = 1 .4907 and a natural frequency 
= 1.4907 rad /s . 

The fuzzy controller designed for the control of the plant described previously is a 
system TISO (two inputs-one output) where the inputs are the error and the change of 
error while the output is the control action. Each one of the controller's variables has 
been divided in 5 fuzzy regions. The fuzzy associative memory, integrated by 25 
rules, it is shown in the figure 4. 

The membership functions were defined in triangular shape for the middle and in a 
trapezoidal shape in the extremes, such that always have it overlap in the grade of 
membership jU(x) = 0.5 (figures 5, 6 and 7). These membership functions will be 
considered later on as the initial conditions for the proposed algorithm. The control 
surface for the fuzzy controller under its initial conditions is shown in the figure 8. 



4 Tuning Algorithm 

The objective of the tuning algorithm is to be able to manipulate, by means of a single 
variable and in a simple way, the settling time of the system, from the answer without 
controller until the response equivalent to 1/5 of the settling time of the answer 
without controller. The response must be fulfilled too with the constraints of small 
overshoots and without persistent oscillations, which means, a very smooth response. 
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Table 1. Fuzzy variables for the controller 



Input variables 


Output variable 


error 


change of error 


control action 


GN: Big negative 
MN : Medium negative 
Z: Zero 

MP: Medium positive 
GP: Big positive 


GN: Big negative 
MN: Medium negative 
Z: Zero 

MP: Medium positive 
GP: Big positive 


DG: Big diminution 
DP: Small diminution 
M: Hold 

AP: Small increase 
AG: Big increase 






GN 


MN 


Z 


MP 


GP 


GN 


AG 


AG 


AP 


DP 


DG 


MN 


AG 


AP 


M 


M 


DG 


Z 


AG 


AP 


M 


DP 


DG 


MP 


AG 


M 


M 


DP 


DG 


GP 


AG 


AP 


DP 


DG 


DG 



Fig. 3. Example of control surface of a Fig. 4. Fuzzy associative memory for the 
fuzzy controller control system 




Fig. 5. Membership functions for the input Fig. 6. Membership functions for the input 
variable error variable change of error 




Fig. 7. Membership functions for the output Fig. 8. Control surface for the fuzzy 
variable control action controller with the membership functions 

under their initial conditions 






788 



E. Gomez-Ramirez and A. Chavez-Plascencia 



This algorithm is based on the properties of the transfer characteristic or, in this 
case, of the control surface that it allows to modify the controller's behavior by means 
of modifications in the position and support of the membership functions maintaining 
fixed the fuzzy controller's structure. Obtaining a slower answer for configurations 
with wide or expanded membership functions in the center and reduced in the ends, 
and the other way, a faster answer for configurations with reduced or compressed 
membership functions in the center and wide in the ends. 

The tuning algorithm only modifies the membership functions of the input 
variables since the disposition of the membership functions of the output fuzzy 
variable remains constant since this disposition is only in function of a proportion of 
the range of the control action, this is, they always remain uniformly spaced. 



4.1 Tuning Factor Selection 

The tuning factor is a number k g [O, l] that determines the grade of tuning 
adjustment obtaining for k = Q the biggest settling time and for k = \ the smallest 
settling time. 



4.2 Normalization of the Ranges of the Fuzzy Controller’s Variables 

In this step the range of each input fuzzy variable is modified so that their limits 
superior and inferior are equal to +1 and -1, respectively. 



4.3 Tuning Factor Processing 

To expand and compress the values in the x-axis of the membership function it is 
necessary to use a function that fulfill this condition such that the new vector of 
operation points will be given by: 



'Where Vop,„,„a are the values normalized of the membership function in the x-axis 
and r(k) is a. polynomial. 




Fig. 9. Plot of function r(k) 
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The initial coefficients of the polynomial were obtained using mean square 
method. The values of k were defined in this way to be able to make an estimate over 
all their range k G [O, l]. The values of r, since it is an exponent, they were defined 
considering the increasing or decreasing of a number that is powered to the exponent 
r. Remember that the goal is to expand (slow response) for k=0 and to compress (fast 
response) k=l. For values below r = 1/40 the answer of the system was not 
satisfactory and in the same way, for values more than r = 3. The polynomial obtained 
was: 



r(k) = 



30k^+37k"+52/t + l 
40 



(3) 



This r(kj was found testing the optimal response for different dynamical systems 
and finding the optimal parameters of the polynomial that fix the function for 
different values ofk(A: = 0,0.5, 1) (figure 11). 



4.4 Denormalization of the Ranges of the Fnzzy Variables 

In this step it is necessary to convert the normalized range to the previous range of the 
system. This can be computed only multiplying the Vop vector by a constant factor. 



5 Results of the Simulation 

The cases will be analyzed for 3 values different of k, k = 0, 0.5, 1, showing the effect 
in the membership functions of the fuzzy variables, the control surface and the graph 
result of the simulation. For all the analyzed cases it will be used as input a step 
function with amplitude 40, the parameters that allow to evaluate the quality of the 
tuning are the settling time (considered to 98% of the value of the answer in stationary 
state), the overshoots and the oscillations. Also for all the analyzed cases the 
controller's structure is fixed, that means, the fuzzy associative memory is the same in 
all the examples. In the simulations it is included (on-line dotted) the answer of the 
system without controller to compare with the response using different tunings of the 
controller. The controller's fuzzy variables and the membership functions for the 
initial conditions are shown in the figures 5, 6 and 7: error, change of error and 
control action respectively. 



5.1 Case 1: Adjusting the Membership Functions with a Tuning Factor k = 0 

The function r(k) takes the value r(0) = 1/40. With the tuning process the vectors of 
operation points for the fuzzy input variables are the following ones: 

Vop error = [-59.3947,-58.3743, 0, 58.3473, 59.3947] 

Vop d I dt(error) = [-19.7982,-19.4581, 0, 19.4581, 19.7982] 
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Making the simulation with the controller's characteristics shown in the figure 10 
the following answer was obtained: 




Fig. 10. Membership functions of the fuzzy variables Fig. 11. Answer of the system for the 
and control surface for k=Q case 1 with k = Q 



In the figure 11 it is shown that with the tuning factor A: = 0 the controller's effect 
on the answer of the system, due to the tuning, is small, approaching to the answer 
without controller. In this case the settling time is the biggest that can be obtained, 
= 4.96 s. 



5.2 Case 2: Adjusting the Membership Functions with a Tuning Factor k = 0.5 

The function r(k) takes the value r(0.5) = 1. Computing the simulation with the 
controller's characteristics shown in the figure 12 the following answer was obtained: 




Fig. 12. Membership functions of the fuzzy variables Fig. 13. Answer of the system for the 
and control surface for k = 0.5 case 2 with k=0.5. 



This case, with the tuning factor k = 0.5, is equal to operate with the initial 
conditions of the membership functions. The settling time is = 3.36 s. 
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5.3 Case 3: Adjusting the Membership Functions with a Tuning Factor k = \ 

The function r(k) takes the value r(l) = 3. Computing the simulation with the 
controller's characteristics shown in the figure 14 the answer was obtained in figure 
15, where the settling time is = 1.6 s and it is the less value that can be obtained. 




Fig. 14. Membership functions of the fuzzy Fig. 15. Answer of the system for the 
variables and control surface for k=l case 3 with k = I 



The controller's effect on the answer of the system has begun to cause a small 
overshoot, due the bigger compression of the membership functions. If the value of 
r(k) is increased, the settling time is not reduced and it only causes bigger overshoots 
and oscillations around the reference. 

To visualize the effect of different values of the tuning factor k over the settling 
time of the answer of the system simulations with increments = 0.05 in the 
interval [O, l] were computed. The result is shown in figure 16. 





Fig. 16. Settling time versus tuning factor k Fig. 17. Comparative graph of the answers of 

the system for different values of k 

Making use of the simulations, and the graphs in figures 16 and 17, it is possible to 
see that the optimal value of k for the tuning is k = 0.9. this value generates a settling 
time t^ = 1.6 s without a great overshoot and without oscillations (figure 18). 

Additionally, the fuzzy controller's performance was compared with that of a 
controller PID (Proportional-integral-derivative) whose parameters are the following 
ones = 25, T. = 1.35 and = 5, and being that the differences are minimum as for 
time of establishment and general behavior (figure 19). 
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Fig. 18. Answer of the system with k = 0.9 Fig. 19. Comparative graph of fuzzy 

controller answer versus PID answer 



The disadvantage found in the controller PID is its inefficiency in comparison with 
the fuzzy controller since the control action generated by the PID can take very big 
values that are impossible to consider in a real implementation. On the other hand, the 
fuzzy controller uses real range of values. 

Considering that this is the fastest answer that can be gotten with the controller 
PID, limited to the nature of the system, that is to say, limiting the range of the values 
that can take the control action to same values that those considered in the fuzzy 
controller's definition, it can be said that the tuning made on the fuzzy controller is 
satisfactory since it allows to vary the time of answer with very good behavior in the 
whole range of the tuning factor. Note that it is not evident to find three parameters of 
the PID for the optimal tuning and in the case of the fuzzy controllers it is necessary 
to increase or decrease the parameter k depending the settling time desired. 



6 Conclusions 

The tuning methods of fuzzy controllers include the handling of a great quantity of 
variables that makes very difficult, and many times non satisfactory the search of 
structures and good parameters. The method proposed uses only one variable and 
operates considering the transfer characteristic, or in this case the control surface that 
is the fuzzy controller's property that defines their behavior allowing that the system 
can response with bigger or smaller speed and precision. 

The function r{k) can be generalized to any system that uses a fuzzy controller 
varying the values r(0) and r(l) as well as the coefficients of the function r{k) 
depending on the desired behavior. 

Another perspective is to create a self tuning algorithm that modifies by itself the 
factor k to find the desired response. In this point, the use of fuzzy controllers presents 
attractive aspects for its implementation in real systems. 
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Abstract. The paper presents an intelligent predictive control to govern the 
dynamics of a solar power plant system. This system is a highly nonlinear 
process; therefore, a nonlinear predictive method, e.g., neuro-fuzzy predictive 
control, can be a better match to govern the system dynamics. In our proposed 
method, a neuro-fuzzy model identifies the future behavior of the system over a 
certain prediction horizon while an optimizer algorithm based on EP determines 
the input sequence. The first value of this sequence is applied to the plant. 
Using the proposed intelligent predictive controller, the performance of outlet 
temperature tracking problem in a solar power plant is investigated. Simulation 
results demonstrate the effectiveness and superiority of the proposed approach. 



1 Introduction 

Model based predictive control (MBPC) [1,2] is now widely used in industry and a 
large number of implementation algorithms due to its ability to handle difficult 
control problems which involve multivariable process interactions, constraints in the 
system variables, time delays, etc. Although industrial processes especially 
continuous and batch processes in chemical and petrochemical plants usually contain 
complex nonlinearities, most of the MPC algorithms are based on a linear model of 
the process and such predictive control algorithms may not give rise to satisfactory 
control performance [3,4]. If the process is highly nonlinear and subject to large 
frequent disturbances, a nonlinear model will be necessary to describe the behavior of 
the process. In recent years, the use of neuro-fuzzy models for nonlinear system 
identification has proved to be extremely successful [5-9]. The aim of this paper is to 
develop a nonlinear control technique to provide high-quality control in the presence 
of nonlinearities, as well as a better understanding of the design process when using 
these emerging technologies, i.e., neuro-fuzzy control algorithm. In this paper, we 
will use an Evolutionary Programming (EP) algorithm [10,11] to minimize the cost 
function and obtain the control input. The paper analyzes a neuro-fuzzy based 
nonlinear predictive controller for a solar power plant, which is a highly nonlinear 
process [12]. The procedure is based on construction of a neuro-fuzzy model for the 
process and the proper use of that in the optimization process. Using the proposed 
intelligent predictive controller, the performance of outlet temperature tracking 
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Fig. 1. Distributed solar collector field schematic. 

problem in a solar power plant is investigated. Some simulations are provided to 
demonstrate the effectiveness the proposed control action. 



2 The Solar Thermal Power Plant 

The schematic diagram of the solar power plant used in this work is depicted in Fig. 1 
[13-15]. Every solar collector has a linear parabolic-shaped reflector that focuses the 
sun's beam radiation on a linear absorber tube located at the focus of the parabola. 
Each of the loops is 142m, while the reflective area of the mirrors is around 264m2. 
The heat transfer fluid used to transport the thermal energy is the Santotherm 55, 
which is synthetic oil with a maximum film temperature of 318°C and an autoignition 
temperature of 357°C. The thermal oil is heated as it circulated through the absorber 
tube before entering the top of the storage tank. A three way valve located at the field 
outlet enables the oil reeyeling (by passing the storage tank) unit is outlet temperature 
is high enough to be sent to the storage tank. The thermal energy storage in the tank 
can be subsequently used to produce electrical energy in a conventional steam 
turbine/generator or in the solar desalination plant operation. In this work the input- 
output data available in [16] is used for the identification of the plant. 



3 Neuro-Fuzzy Identification and Predictive Control of the Plant 

In predictive control approach, control system anticipates the plant response for a 
sequence of determined control action in future time horizon [1,2]. The optimal 
control action in this time horizon is a good choice to minimize the difference 
between desired and predicted responses. MFC takes advantage of this prediction. It 
is originally developed for linear model of the plant that provides the prediction 
formulation. The MFC was developed for limited classes of nonlinear systems. In 
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Fig. 2. Neuro-fuzzy predictive control scheme. 

some cases, on-line estimation provides parametric estimation of nonlinear process 
that can be used for an MPC methodology. Neuro-fuzzy system, as universal 
approximator, may be considered for identification of nonlinear systems. This 
nonlinear mapping is used for process output prediction in future time horizon. 

The structure of intelligent adaptive predictive control is shown in Fig. 2. 
Prediction system is formed by Neuro-Fuzzy Identifier (NFI) to generate the 
anticipated plant output for a future time window, < t < N 2 ■ The fuzzy rules and 
membership functions of this identifier can be trained off-line by the actual measured 
data of solar power plant system. The future control variable for this prediction stage 
is determined in an optimization algorithm for the time interval of N] < t < , such 

that < N 2 , minimizing the following cost function: 

J= Z + + + Z ||AM(t-l-^)||^ +||y(t)-r(f)||^ (1) 

k=Ni k=Ni 

where y(t + k) is predicted plant output vector which is determined by NFI for time 
horizon of N^<k<N 2 , r(t + k) is desired set-point vector, A«(t-t-A:)is predicted 
input variation vector in time range of < k < yit) and rit) are present plant 
output and set-point vectors, respectively. The optimization block finds the sequence 
of inputs to minimize cost function in (1) for future time, but only the first value of 
this sequence is applied to the plant. This predictive control system is not model- 
based and is not using the mathematical model of the plant. Therefore, the 
optimization cannot be implemented by conventional methods in MPC. The search 
engine based on EP is used to determine the optimized control variable for the finite 
future horizon. The competition search is performed on initial randomly chosen 
vectors of input deviation in a population and their mutated vectors. The mutation and 
competition continues to achieve desirable cost value. 

We use the neuro-fuzzy network proposed in [9], a four-layer network, for the 
objective of predictive control. The first layer of identifier network represents the 
input stage that provides the crisp inputs for the fuzzification step. The second layer 
performs fuzzification. The weights and biases respectively represent the widths and 
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means of input membership functions. Using exponential activation function, the 
outputs of the second layer neurons are the fuzzified system inputs. Weighting matrix 
of the third layer input represents antecedent parts of rules, and is called the Premise 
Matrix. Each row of the premise matrix presents a fuzzy rule such as: 

Rpvemise-IF is AND... AND is THEN... 
where is the jth linguistic term of the ith input. 

The neuron output is determined by the min composition to provide the firing 
strength of rules. The fourth layer consists of separate sections for every system 
output. Each section represents consequent parts of rules for an output, such as: 

Rconsequent ■ - THEN... y,- is T^^ ... 

where is the jth linguistic term of the ith output. Layer 5 makes the output 

membership functions. Combination of the fifth and sixth layers provides 
defuzzification method. Weighting matrix of the fifth layer for each output section is 
a diagonal matrix that contains width of the output membership functions. The 
activation value of each neuron provides one summation term in defuzzified output. 
The linear activation function determines output of each output section. The sixth 
layer completes defuzzification, and provides crisp output. The weighting vector of 
each neuron contains means of the output membership functions. The crisp output is 
derived by using activation function to implement the Center of Gravity 
approximation. Identification process may not perform desirably if it does not include 
the input/output interaction. For this purpose, series-parallel configuration [17] is 
chosen. This identification structure considers the past output states in conjunction 
with the present inputs to determine the present output. The identifier with augmented 
inputs is represented by 

yik + 1) = yik - i);u(k),...,u(k - j)) (2) 

such that y(k) is the estimated output at time step k, / is identifier function, u(k) and 
y(k) are plant input and output vectors, respectively, at time step k. 

Adaptation of neuro-fuzzy identifier to a solar power plant system is essential to 
extract an identifier that truly models the system. Training algorithms enable 
identifier to configure fuzzy rules and adjust membership functions to model a solar 
power plat system with certain error penalty. Training of neuro-fuzzy identifier is 
taking place in two phases of configuration and tuning. Configuration phase 
determines fuzzy rules automatically based on available data from the system 
operation. For this purpose Genetic Algorithm (GA) training is chosen [18], because 
of specific structure of the neuro-fuzzy identifier. Fuzzy membership functions are 
adjusted during tuning to reduce modeling error. Error backpropagation method is 
used for this tuning. 

In the start of training, the identifier is initialized with default input/output 
membership functions and fuzzy rules. Positions of ‘I’s in the premise and 
consequence weighting matrices of the third and fourth layers define fuzzy rules. 
These matrices are encoded in the form of GA chromosome. We recall that the fourth 
layer consists of several sub-sections because of multiple outputs. Therefore, a GA 
chromosome has a compound structure with one section as the number of system 
outputs. In this work, GA with non-binary alphabet [19] will be the training method. 
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Alphanumeric size of each section is equal to the membership function number of the 
output. The GA will act separately on these sub-chromosomes to find the best fit. 
Having a set of experimental input/output plant data points, GA can be applied to find 
optimal set of fuzzy rules. 

The fitness function, based on the least squares principle, provides evaluation of 
population individuals. To complete the GA iteration, it is necessary to prepare the 
next generation of population with applying three GA operators: selection, crossover 
and mutation. The weighted roulette wheel is used as selection operator that assigns a 
weighted slot to each individual [20]. Crossover operator generates two offspring 
strings from each pair of parent strings, chosen with probability of . Crossover 
takes place in every sub-chromosome of parents. Crossover points are determined 
randomly with uniform distribution. Mutation operator changes value of a gene 
position with a frequency equal to mutation rate . The new value of a chosen gene 
will be randomly determined with uniform distribution. Tuning the parameters of 
fuzzy membership functions completes training of the neuro-fuzzy system. Adjusting 
the membership function increases the accuracy of the identifier, since the initial 
membership functions have been chosen in the beginning of the training. Error back- 
propagation is used for training self-organized NF. More details about this method of 
identification can be found in [9]. The learning rates should be chosen appropriately. 
A small value of learning rate provides slow convergence. Moreover, stability may 
not be achieved with using large learning rate. Training ends after achieving specified 
error or reaching the maximum iteration number. Mean and width of the input/output 
membership functions are updated with final values of weighting matrices and bias 
vectors. For input-output data [16] the proposed neuro-fuzzy identification is 
performed and will be used for the objective of predictive control. 



4 Control Input Optimization 



The intelligent predictive control system does not depend on the mathematical model 
of the plant. Therefore, the optimization cannot be implemented by conventional 
methods in MFC. The search engine-based on evolutionary programming (EP) is used 
to determine the optimized control variables for a finite future time interval. The EP 
performs a competition search in a population and its mutation offspring. The 
members of each population are the input vector deviations that are initialized 
randomly. The mutation and competition continue making new generations to 
minimize value of a cost function. The output of the optimizer block is the control 
valve deviations that are integrated and applied to the identifier and solar power plant 
unit. The EP population consists of the individuals to present the deviation of control 
inputs. This population is represented by the following set 






.AC/„ 



(3) 



such that f/„ is the nth generation of population, and rip is the population size. The 
ith individual is written by 



AG,„ = [AV'",...,A«J-”],/or i = 1,2,. ..,m^ 



(4) 
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where m is number of inputs. The ’ is the jth vector of the ith generation as in 
the following 

n 

such that “ is the number of steps in the discrete-time horizon for the power unite 
input estimation that is defined by 

«» = ( 6 ) 

where is the start time of prediction horizon and is the end time of the input 

prediction. The individuals of input deviation vector belongs to a limited range of real 
numbers 

(7) 

In the beginning of EP algorithm, population is initialized randomly chosen 
individuals. Each initial individual is selected with uniform distribution from the 
above range of corresponding input. 

The EP with adaptive mutation scale has shown a good performance in locating the 
global minima. Therefore, this method is used as it is formulated in [11]. The fitness 
value of each population is determined with a cost function to consider the error of 
predicted input and output in prediction time window. The cost function of the ith 
individual in the population is defined by 

fi,n = T\rit + k)-yi^^{t + k)f + 
k=\" 

« ( 8 ) 

t\\^U,^„(k)\\ +\\r(t)-m\\l 

k=l ^ 

where r(t + k) is the desired reference set-point at sample time of t + k , 
and y, •„(! -t A:) is the discrete predicted plant output vector which is determined by 
applying At/, „(fe) into the locally-linear fuzzy identifier for time horizon of 

ny=N2-Ny. 

The AUj „{k) in (8) is the kth of the ith individual in the nth generation. The input 
deviation vectors is determined in a smaller time window of as in (6) such that 
«„ < . The inputs of the identifier stay constants after t + . 

The maximum, minimum, sum and average of the individual fitness in the nth 
generation should be calculated for further statistical process by 

./^max I « ^fi,n^fi,n f j ,n^f j ,n ^ J [9) 

/"min I « — f kn j ,n ^ J (10) 



f sum n ^ f i.n 



( 12 ) 
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After determining the fitness values of a population, the mutation operator 
performs on the individuals to make a new offspring population. In mutation, each 
element of the parent individual as in (5) provides a new element by adding a random 
number such as 

Am 7"^’" {k) = Auf (k) + N(m, cjfj in)) 
for i = 1,2,..., rip j = l,2,...,m k = \,2,...,n^ 

such that N{ju,(Ti j{n)) is Gaussian random variable with mean ju = 0 and variance of 
2 

<7i j{n) . The variance of the random variable in (13) is chosen to be 



f. 

(^i.j («) = A«)(AM^',max - (14) 

J max I n 

where /](n) is the mutation scale of the population such that 0</](n)<l. After mutation, 
the fitness of offspring individuals are evaluated and assigned to them. 

The generated new individuals and old individuals produce a new combine 
population whit size of 2«^ . Each member of the combined population competes with 

some other members to determine which one is valuated to survive to the next 
generation. For this purpose, the ith individual AC/,- „ competes with jth 



individual AC/ J „ , such that j = 1,2,..., p . The number of individuals to compete whit 
is a fixed number p. The p individuals are selected randomly whit uniform 



distribution. The result of this competition is a binary number 
represent lose or win, and is determined by 

fi.n 



-ij.n 



1 



0 






f j.n + fi.n 



{0,1} to 



(15) 



otherwise 



such that Aj „ £ [0,1] is a randomly selected number with uniform distribution, and 



fj „ is the fitness of the jth selected individual. The value of Vy „ will be set to 1 if 

according to (15) the fitness of the ith individual is relatively smaller than the fitness 
of the jth individual. To select the survived individual, a weight value is assigned to 
each individual by 

Wi.n = llVij,n for i = \,2,...,2n (16) 

k=l 



The lip individuals whit the highest competition weight w,- „ are selected to form 

the (n+l)th generation. This newly formed generation participates in the next 
iteration. To determine the convergence of the process, the difference of maximum 
and minimum fitness of the population is checked against a desired small number £>0 
as in 



/"max n f \x 



< e 



(17) 



If this convergence condition is met, the mutation scale with the lowest fitness is 
selected as sequence of input vectors for the future time horizon. The first vector is 
applied to the plant and the time window shifts to the next prediction step. 




Predictive Control of a Solar Power Plant with Neuro-Fuzzy Identification 



801 




Fig. 3. Simulation results of set point, outlet oil temperature and oil flow rate. 
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Fig. 4. Simulation results of solar radiation and inlet oil temperature. 



Before starting the new iteration, the mutation scale changes according to the 
newly formed population. If the mutation scale is kept as a small fixed number, EP 
may have a premature result. In addition, a large fixed mutation scale will raise the 
possibility of having a non-convergence process. An adaptive mutation scale provides 
a change of mutation probability according to the minimum fitness value of 

individuals in the (nH-l)th generation. The mutation scale for the next generation is 
determined by 



Pin + 1 ) 



\ Pin) -P step 

\Pin) 



if fm 

i/'/mi 



I /min I n+l 

^ fimn I n+l 



( 18 ) 



where n is generation number, is the predefined possible step change of the 

mutation scale in each iteration. 
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5 Simulation Results 

The proposed intelligent predictive control is tested on a real solar power plant with 
input-output data available in [16]. In order to maintain (or drive) the outlet oil 
temperature at the pre-specified level despite variations in the sun's beam radiation 
and in inlet oil temperature, the control system manipulates the thermal oil flow rate 
pumped to the solar collector field. After the initial training, identifier is engaged in 
the closed loop of the predictive control as in Fig. 1. The parameters of prediction 
horizon is selected to be Hy = 50, [riy = A 2 -A^i), andn^ =20, with time step of 

At = 6 sec . Population size is chosen to be tip = 15 . The crossover and mutation rates 
are chosen to be = 0.75 and = 0.002 , respectively, in GA training. 

Figs. 3 and 4 show the outlet oil temperature ) , set point ) , oil flow rate 

(V) and the inlet oil temperature (?)„ ) , the solar radiation (I) , respectively. As it 
can be seen, the proposed intelligent predictive control provides a very interesting 
dynamic response of the outlet oil temperature, being the control system quiet stable 
in all the operating points. 



6 Conclusion 

In this paper, an intelligent predictive control was applied to solar power plant system. 
This system is a highly nonlinear process; therefore, a nonlinear predictive method, 
e.g., neuro-fuzzy predictive control, can be a better match to govern the system 
dynamics. In our proposed method, a neuro-fuzzy model identified the future 
behavior of the system over a certain prediction horizon. An optimizer algorithm 
based on EP used the identifier-predicted outputs and determined input sequence in a 
time window. Using the proposed neuro-fuzzy predictive controller, the performance 
of outlet temperature tracking problem in a solar power plant was investigated. 
Simulation results demonstrated the effectiveness and superiority of the proposed 
approach. 
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Abstract. Since liquid tank systems are commonly used in industrial 
applications, system-related requirements results in many modeling and control 
problems because of their interactive use with other process control elements. 
Modeling stage is one of the most noteworthy parts in the design of a control 
system. Although nonlinear tank problems have been widely addressed in 
classical system dynamics, when designing intelligent control systems, the 
corresponding model for simulation should reflect the whole characteristics of 
the real system to be controlled. In this study, a coupled, interacting, nonlinear 
liquid leveling tank system is modeled using ANFIS (Adaptive-Network-Based 
Fuzzy Inference System), which will be further used to design and apply a 
fuzzy-PID control to this system. Firstly, mathematical modeling of the system 
is established and then, data gathered from this model is employed to create an 
ANFIS model of the system. Both mathematical and ANFIS model is 
compared, model consistencies are discussed, and flexibility of ANFIS 
modeling is shown. 



1 Introduction 

Artificial Neural Networks (ANNs) and Fuzzy Logic (EL) have been increasingly in 
use in many engineering fields since their introduction as mathematical aids by 
McCulloch and Pitts, 1943, and Zadeh, 1965, respectively. Being branches of 
Artificial Intelligence (AI), both emulate the human way of using past experiences, 
adapting itself accordingly and generalizing. While the former have the capability of 
learning by means of parallel connected units, called neurons, which process inputs in 
accordance with their adaptable weights usually in a recursive manner for 
approximation; the latter can handle imperfect information through linguistic 
variables, which are arguments of their corresponding membership functions. 

Although the fundamentals of ANNs and EL go back as early as 1940s and 1960s, 
respectively, significant advancements in applications took place around 1980s. After 
the introduction of back-propagation algorithm for training multi-layer networks by 
Rumelhart and McClelland, 1986, ANNs has found many applications in numerous 
inter-disciplinary areas [1-3]. On the other hand, EL made a great advance in the mid 
1970s with some successful results of laboratory experiments by Mamdani and 
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Assilian [4], In 1985, Takagi and Sugeno [5] contributed FL with a new rule-based 
modeling technique. 

Operating with linguistic expressions, fuzzy logic can use the experiences of a 
human expert and also compensate for inadequate and uncertain knowledge about the 
system. On the other hand, ANNs have proven superior learning and generalizing 
capabilities even on completely unknown systems that can only he described by its 
input-output characteristics. By combining these features, more versatile and robust 
models, called “neuro-fuzzy” architectures have been developed, [6-7]. 

In a control system the plant displaying nonlinearities has to be described 
accurately in order to design an effective controller. In obtaining the model, the 
designer has to follow one of two ways. The first one is using the knowledge of 
physics, chemistry, biology and the other sciences to describe an equation of motion 
with Newton’s laws, or electric circuits and motors with Ohm’s, Kirchhoff’s or 
Lentz’s laws depending on the plant of interest. This is generally referred to as 
mathematical modeling. The second way requires the experimental data obtained by 
exciting the plant, and measuring its response. This is called system identification and 
is preferred in the cases where the plant or process involves extremely complex 
physical phenomena or exhibits strong nonlinearities. 

Obtaining a mathematical model for a system can be rather complex and time 
consuming as it often requires some assumptions such as defining an operating point 
and doing linearization about that point and ignoring some system parameters, etc. 
This fact has recently led the researchers to exploit the neural and fuzzy techniques in 
modeling complex systems utilizing solely the input-output data sets. Although fuzzy 
logic allows one to model a system using human knowledge and experience with if- 
then rules, it is not always adequate on its own. This is also true for ANNs, which 
only deal with numbers rather than linguistic expressions. This deficiency can be 
overcome by combining the superior features of the two methods, as is performed in 
ANFIS architecture introduced by Jang, 1993 [8-11]. 

In the literature, ANFIS applications are generally encountered in the areas of 
function approximation, fault detection, medical diagnosis and control, [12-17]. In 
this study, ANFIS architecture was used to model the dynamic system, which is taken 
as a black-box, i.e. described by its observed responses to the introduced inputs. 



2 Fuzzy Modeling and ANFIS Architecture 

In fuzzy system identification, first, system parameters should be determined. In this 
study, while modeling coupled nonlinear liquid leveling tank system, the ANFIS 
architecture based on Takagi - Sugeno fuzzy modeling is employed. With a hybrid 
learning procedure, ANFIS can learn an input - output mapping combining the 
complimentary features of Neural Networks and Fuzzy Logic. The regression vector 
is chosen as NARX model as follows. 

g> = [u(t-k), u(t-n), y(t-k), y(t-m)] (1) 

For simplicity, fuzzy inference system is to be considered as having two inputs (x, 
y) and one output (z) (MISO). A first order Sugeno model [5] can be expressed with 
two rules as follows and the related inference method is shown in Fig.l. 
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Rule 1 IF X Aj and y then/j = p^x + q^y + 
Rule 2 IF X Aj and j then/ 2 = P 2 X + q^y + 




Wl 

fi= Pix + qiy + n 



W2 



fi = P 2 X + q2y + r2 



Fig. 1. The inference method of Sugeno model 



Using the/j and /^membership functions the output function for this Sugeno model 
is expressed as, 



W1/1 + W2/2 

Vfj +W2 
= Vfi/j+W2/2 



( 2 ) 



The corresponding ANFIS model for Sugeno’ s fuzzy structure is given in Fig. 2. 



layer 4 

layer 1 x y 

X 



y 



X y 

Fig. 2. The ANFIS model Sugeno’ s fuzzy inference method 

As it is seen from Fig. 2, ANFIS has 5 layers and functions of these layers are 

explained below: 

Layer 1 : In this layer where the fuzzification process takes place, every node is 

adaptive. Outputs of this layer form the membership values of the premise 
part. 

Layer 2: In contrary to Layer 1 the nodes in this layer are fixed. Each node output 

represents a firing strength of a rule. 

Layer 3: In this layer where the normalization process is performed, the nodes are 

fixed as they are in Layer 2. The ratio of the ith rule’s firing strength to the 
sum of all rule’s firing strength is calculated for the corresponding node. 








Modeling of a Coupled Industrial Tank System with ANFIS 807 

Layer 4: Since the nodes in this layer operate as a function block whose variables 

are the input values, they are adaptive. Consequently the output of this 
layer forms TSK outputs and this layer is referred to as the consequent 
part. 

Layer 5: This is the summation layer. Which consist of a single fixed node. It sums 

up all the incoming signals and produces the output. 



3 Modeling a Nonlinear Coupled-Tank System with ANFIS 

In this part, an ANFIS model of a nonlinear coupled tank system is obtained. Fig. 3. 
shows a simple double tank liquid-level system with a valve between [18]. Each tank 
has and outlet port with and flowrates. First tank is fed with Q, flowrate. The 
system is configured as a SISO system, Q, - input and h^ - output. By formulizing 
mass input-output balance for each tank and Bernoulli equations, following nonlinear 
equations are obtained: 



Automatic control valve to adjust 
the liquid levels of the tanks hy 
controlling the flap angle (p 



Q. 




Discharge 

valve 




Valve to adjust the 
flow rate Qj 
between tanks 



i 2^ 



♦ 

h, 

± 



Discharge 

valve 



Fig. 3. A simple schematic for an interacting coupled-tank level control system. 



d!\ 

dt 



— Cl ^ 



Aj = -a2-\[h2 ~ -h^ 



(3) 



Q, is the volumetric flow rate and can reach maximum 0.12 m /sec and adjusted by 
a valve. Aj and A^ are the base surface area of each tank which are 1 ml aj, a^ and aj 
are proportionality constants. The values of these constants are dependent on the 
discharge coefficients, cross-section area of the outlets and the gravitational constants. 
Outlet area of tank 1 and II are 0.01 m^ and outlet area between two tanks is 0.05 ml 
Qj is the controlled input liquid flow rate and given as 

?i„ = 2i„-sin(9)(f)) (9(f)s[0,;r/2] (4) 



An excitation signal which represents input valve angel values between 0 and 7i!2 
radian is created by means of adding two sinusoidal signals, Fig.4. The output data set 
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is produced using the expressions above in MATLAB SIMULINK, Fig.4. The data 
set obtained is fed to the ANFIS in order to approximate a model of the system to be 
controlled. All these computations are carried out in MATLAB SIMULINK. 




Fig. 4. Training data for ANFIS from the double tank system illustrated (input data - dashed, 
output data - solid) 



4 Simulation Results 

This section presents the simulation results of proposed ANFIS model for the tank 
liquid-level system and its control. In order to obtain an input - output data set, an 
input data set which represents the changes of the control valve’s flap angle, q>, in the 
range of 0 (no flow in) to 7d2 (full flow) radians for 2000 samples, is produced 
analytically as plotted in Fig.4. This input is applied to the system and the 
corresponding liquid level outputs in meters are obtained. 

In the modeling, regression vector is chosen as NARX model, which involves past 
input and output to approximate the system’s input-output relationship, shown below. 

^=[i<(f-l) y(t-l)] 

Here, the system is modeled as a MISO system having two inputs and one output. 
Each input variable is represented by five membership functions, which make 25 
rules. The membership functions for this Takagi-Sugeno fuzzy model are chosen as a 
generalized bell function. The input-output data is processed by ANFIS and hence 
the proposed model is obtained. Below in Fig. 5, the outputs of mathematical and 
ANFIS models are plotted for comparison. The time axes for simulation results are 
plotted in terms of discrete index number. The corresponding time scale in second is 
selected automatically by MATLAB. Fig. 6 shows the actual difference between the 
two models at each sample. 

As an actual evaluation process of the proposed ANFIS model, nonlinear double 
tank system is evaluated through set of input data either similar to or different from 
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training data and compared with the mathematical model of the original system. First, 
original training data which is the combination of sinusoidal and step signals, is 
applied to both ANFIS and mathematical models and the response of the systems are 
gathered. Fig. 5. Comparison of these two in Fig. 6. shows that general conformity of 
the ANFIS model to the mathematical model is very successful. Only around zero 
input value does the system represent the maximum discrepancy which is around 8 
cm. Second, system evaluation is continued with the unit step input application and 
input and output of the models are illustrated in Fig. 7 and conformance between 
mathematical and ANFIS models is shown. Lastly, the ANFIS model is forced against 
a random input which is again combination of more complex functions in Fig. 8. 
Again, general agreement of two signals is successful but around zero input value. 
Fig. 9. and Fig. 10. 




Fig. 5. Responses of the ANFIS and mathematical model of double tank system to the training 
input data (mathematical model output - dashed, ANFIS model output - solid). 
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Fig. 6. Difference between mathematical and ANFIS model of double tank system for the 
training input response. 
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Fig. 7. Mathematical and ANEIS model response of double tank system to a unit step input 
(input step - dotted, math response - dashed, ANEIS response - solid) 




Fig. 8. A non-uniform, random input for evaluation of double tank system ANEIS model 



5 Conclusions 

It is generally not possible to derive an accurate model of a process or plant especially 
with nonlinearities. If a reliable model is not available, it is quite difficult to design a 
controller producing desired outputs. On the other hand, traditional modeling 
techniques are rather complex and time consuming. However, using input-output data 
set, ANFIS can approximate the system. When the data set does not represent the 
whole operating range adequately, the model to be obtained will not be as robust. 
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Fig. 9. Double tank system ANFIS and mathematical model responses to a random input data 
(math - dashed, ANFIS - solid) 




time (sec) 

Fig. 10. Difference between mathematical and ANFIS model of double tank system for random 
input data. 



During modelling stage, ANFIS, using Sugeno model with two inputs and one output, 
each having five membership functions, is employed and a nonlinear coupled-tank 
liquid-level system is modelled successfully. Model evaluations are performed 
through various forms of input data and comformance between mathematical and 
ANFIS models is represented and ability of ANFIS to model a nonlinear system is 
shown. 
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Abstract. Artificial neural networks (ANN) have been nsed as predic- 
tive systems for a variety of application domains such as science, engi- 
neering and finance. Therefore it is very important to be able to estimate 
the reliability of a given model. Bootstrap is a computer intensive method 
used for estimating the distribution of a statistical estimator based on 
an imitation of the probabilistic structure of the data generating process 
and the information contained in a given set of random observations. 
Bootstrap plans can be used for estimating the uncertainty associated 
with a value predicted by a feedforward neural network. 

The available bootstrap methods for ANN assume independent random 
samples that are free of outliers. Unfortunately, the existence of outliers 
in a sample has serious effects such as some resamples may have a higher 
contamination level than the initial sample, and the model is affected 
because it is sensible to these deviations resulting on a poor performance. 
In this paper we investigate a robust bootstrap method for ANN that is 
resistant to the presence of outliers and is computationally simple. We 
illustrate our technique on synthetic and real datasets and results are 
shown on confidence intervals for neural network prediction. 

Keywords: Feedforward Artificial Neural Networks, Bootstrap, Robust 
theory. Confidence Interval. 



1 Introduction 

Feedforward neural networks (FANN) with nonlinear transfer functions offer uni- 
versal approximation capabilities based wholly on the data itself, i.e., they are 
purely empirical models that can theoretically mimic the input-output relation- 
ship to any degree of precision. 

Our estimations about the characteristics of a population of interest relay 
on the treatment of a sample or a set of prototypes that are realizations of an 
unknown probability model. The source of confidence about these estimations 
are the result of the fact that we are trying to generalize based on a limited set of 
samples that are contaminated by the process of data acquisition. The question 

* This work was supported in part by Research Grant Fondecyt 1010101 and 7010101, 
and in part by Research Grant DGIP-UTFSM. 
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is how to do an estimate of the probabilistic behaviour of our estimates, which 
are functionally dependent on the unknown probabilistic model of the data. 

The traditional way to treat this problem has been by imposing strong as- 
sumptions over the probabilistic laws on the data generating process or by taking 
the asymptotic approach (see [7] and [9]). 

Computationally intensive methods based on bootstrap techniques [3] have 
been used for evaluating the performance of ANN [6], but no care is taken under 
contaminated data. The existence of outliers in a sample is an obvious problem in 
inference which can become worse when the usual bootstrap is applied, because 
some resamples may have higher contamination levels than the initial samples 
[10]. Bootstrapping robust estimators has some drawbacks, namely, numerical 
stability and high computational cost. Usually, real data is contaminated by 
outliers, i.e., observations that are substantially different to the bulk of data, 
with different sources of variation yielding exceptions of different nature, so 
rejection of outliers is not an acceptable treatment. 

In this paper we propose a modification of the bootstrap procedure applied 
to neural networks that is resistant to the presence of outliers in the data and is 
computationally simple and feasible. The paper is organized as follows: in section 
2 we introduce the notation and architecture of feedforward neural networks 
(FANN); in section 3 we present the bootstrap and its application to neural 
networks; in section 4 we develop our robust version of the bootstrap procedure; 
finally in section 5 we present a simulation study on datasets and the conclusions 
are drawn in the last section. 



2 Feedforward Artificial Neural Networks 

A FANN consists of elementary processing elements called neurons, organized 
in layers: the input, the hidden and the output layers. The links of the neu- 
rons are from one layer to the successive without any type of bridge, lateral or 
feedback connections. For simplicity, a single-hidden-layer with only one output 
architecture is considered in this paper, so the different class of neural models 
can be specified by the number of hidden neurons by S\ = {g\{x,w) G M, x G 
M™, w G W}, where W C is the parameter space and is assumed that is 
convex, closed and bounded, m is the dimension of the input space, g\{x,w) is 
a non-linear function of x with w = (wi,W 2 , ..., Wd)'^ being its parameter vector, 
A is the number of the hidden neurons and d = (m -I- 2) A -I- 1 is the number of 
free parameters. 

Given the sample of observations x = {x/^, where we suppose a noisy 

measured output yk which is considered as the realization of the random variable 
Y = Y\x conditioned to x^.. We assume that there exists an unknown regression 
function = E[F|x] such that for any fixed value of x, the stochastic process 
is determined by Y\x = U[U|x] -I- e, where £ is a random variable with zero 
expectation and variance The task of neural learning is to construct an 
estimator g\{x,w) of the unknown function (p{x) by 
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y = 9\{x,w) = 72 




( 1 ) 



where w is a parameter vector to be estimated, A is a control parameter (number 
of hidden units) and, an important factor in the specification of neural models, is 
the choice of the ’activation’ function 7^, these can be any non-linear functions as 
long as they are continuous, bounded and differentiable. The activation function 
of the hidden neurons 71 typically is a logistic function 71(2) = [1 + exp{—zY\~^ . 
For the output neuron the function 72 could be a linear function f(z) = z, or a 
nonlinear function. 



3 Bootstrap and Bootstrapping Nenral Networks 

The Bootstrap of Efron [ 3 ] is a method for estimating the distribution of an 
estimator or test statistic by resampling the data or a model estimated from 
the data. Bootstrap is actually a well tested tool in many areas of parametrtic 
and nonparametric statistics and a field of intensive research over the last two 
decades. 

In the context of regression models, two types of bootstrap have taken force: 
the residual and the pairwise bootstrap. Both approaches attempt to respect 
the dependence between predictors and targets. The approach named residual 
bootstrapping applies a bootstrap plan over the residuals of the model, this plan is 
usually nonparametric but can be smoothed or parameterized if we have enough 
information. To obtain the set of residuals {ei,£2, • ■ • we adjust the model 
g{x) to the data and obtain the set of parameters for the model, where the 
entities under study can be these parameters or a function of theses parameters. 
Applying the bootstrap plan to the set of residuals we can obtain a bootstrap 
residual set, say {e\,e2, ■ ■ ■ , e))}. To generate a bootstrap data set we follow the 
recursive scheme yk = g{xk) + £fc- We repeat the procedure several times and 
for each bootstrap data set we adjust the model again. In the case of Pairwise 
Bootstrap the idea is resampling directly the training pairs {xi^,yk}k=i taking 
both predictors and target simultaneously. The bootstrap samples obtained with 
both methods are used to adjust the models and generate an estimate of the 
distribution of the quantity of interest. 

Unfortunately several problems arise with the FANN models, they are neither 
globally nor locally identified due to the fact that the parameterization of the 
network function is not unique (for example, we can permute the hidden units), 
the possible symmetry of the activation function and the presence of irrelevant 
hidden units. However we can always infer about a distinguishable function of 
the weight vector such as the network output. 

4 Robust Bootstrap Procedure in Neural Networks 

In this section we construct a robust bootstrap plan for neural networks based 
on a robust resampling algorithm and robust estimators for the models. Ro- 
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bust estimations are designed to produce estimates that are immune to serious 
distortions up to certain number of outliers. However, this approach remains 
insufficient because the resampling procedure can break even an estimator with 
a high breakdown point. 

To address the potentially harmful replication of outliers in the bootstrap 
samples, we work directly over the estimated distribution that draws the ob- 
servations. In [2], they introduce a perturbation of the resampling probabilities 
ascribing more importance to some sample values than others using the influence 
function (see [5]) to compute those selection probabilities. This procedure leads 
to resampling less frequently those observations that affect mostly the initial 
model while assigning higher probabilities to the observations forming the main 
structure. 



4.1 Robust Learning Algorithm 

First, we deal with the problem of a robust model that is resistant to outlying 
observations. In some earlier works it is shown that FANN models are affected 
with the presence of outlying observations, in the way that the learning process 
and the prediction performance are deteriorated (See [1]). 

Let the data set y = consist of an independent and identi- 

cally distributed (i.i.d) sample of size n coming from the probability distri- 
bution F(x,y). A nonlinear function y = 1 ^( 3 ;) -I- e is approximated from the 
data by a FANN model y = g\{x,w*) + e. An M-estimator is defined by 
= arg min{RLn{w) : w € W}, where RLn{w) is a robust functional cost 
given by the following equation. 



RLn{w) 



^ sr fyk-g\{xk,w) 

fc=l ^ ® 



(2) 



where p is the robust function that introduces a bound to the influence due 
to the presence of outliers in the data. Assuming that p is differentiable and 
its derivative is given by 1 ^ an M-estimator can be defined 

implicitly by the solution of 

■^,(yk-g\(.Xk,w)\ 

]Dgx{x^,w) = 0 

k=i V cTe y 



where V' : IR x kV — >■ IR, = j/t — g\{x*,w^) is the residual error and 

f d d ^ ^ 

Dgx{x,w) = l-^gx{x,w ), . . ., -^^gx{x,w) 



(3) 



is the gradient of the FANN. We will denote Dgx = Dgx{x,w) for short. 
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4.2 Robust Resampling 

When we resample the original set of observations, at least in the non-parametric 
bootstrap case, we assign equal selection probabilities to all observations, but the 
bootstrap samples can be harmfully altered by outliers. The bad behaviour of 
the bootstrap when there are outliers in the mother sample have been referenced 
in several papers [10] [2] [11]. We will adapt their proposal for robustifying 
bootstrap resampling algorithm presented in [2]. 

To obtain the selection probabilities of the observations, we measure the in- 
fluence of the particular data over the parameters. The IF is a local measure 
introduced by Hampel [5] and describes the effect of an infinitesimal contami- 
nation on the estimate at the point (x, y). The IF of the M-estimator applied to 
the FANN model calculated at the distribution function F{x,y) is given by, 

IF{x,r;w* , F) = Dg\{x,w*)'^ (4) 

where r = r{x,y) = y — g\{x,w) is the residual, Dg\{x,w*) is given by equa- 
tion (3) and M is given by M = J^F[{x,r,w*)dF{x,y) = E.f[F[], where F[ 
is the Hessian of p(-) with respect to the parameters w, i.e. , F[{x,r,w) = 

{ij'{r,w)DgxDgl - 'tl;{r,w)D‘^gx), and D^gx = is the Hessian ma- 

trix of the FANN of side d x d. In practice, M is not observable and must be 
estimated. White [12] demonstrated that a consistent estimator of M is 

— 1 ” 

M^ = -Y,H{x^,rk,w^) (5) 

where are the parameters obtained from the data by the minimization of 
the risk function (2). With this result, we can estimate the influence at the point 
(x*,y*) by IF{x*,r*;w^) = Dgx{x\w^Y 

Instead we used the standardized influence function as a measure of the 
impact (or the distance) of the observation to the model: 

SIF{r,x,w* , F) = ^ IF{r, x, w*,F)'^V {'uf)~^IF{r, x, w*,F) (6) 

where the variance of the estimator is given by, 

V{w*) = J IF{x,y;w*,F)IF{x,y;w*,F)'^dF{x,y) , . 

= M{jp,F)-^Q{i;,F)M{i;,F)-'^ 

and 

Q{tp,F) := J i^{r,w*fDgx{x,w*)Dgx{x,w*)'^dF{x,y) 

= Ef [V” ( A w* ) ^ Dgx (x, w * ) Dgx (x, w* )^] 
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A consistent estimator of Q is given in [12] by 

With this result, we can estimate the variance of the estimator V{w^) = 
where r* = y* - g\{x*,w^). 

Now we introduce our proposal of the robust version of the resampling proce- 
dure that will be the basis for the Robust FANN Algorithm that we will propose. 



Robust Resampling Algorithm 

1. Compute the estimated , SIF^ = SIF{rk,Xi^,w, F), at each data point 
fc = 1, 2 , . . . , n. 

2. Compute the resampling distribution P = (pi,p2,... ,Pn) where, pk = 

— and 

Ej=i 

Wk = I[o,c] ( SIF +g(^c, SIFk ) x /[c, oo] SIFk ) k= l..n (8) 



where I[a,b] is the indicator function of the interval [a, &], c > 0 is the tuning 
constant, and g is the attenuating function activated at c, non-negative, 
lim t‘^g{c, t) = 0 for c fixed and |t=c = 0. The first condition is evident 

because the function g{-) is only attenuating the probability of resampling 
of an observation proportional to its influence; second, guarantees that the 
outliers don’t introduce a significative bias. Last is a smoothness condition 
that preserves the efficiency of the procedure with few outliers. 

3. Apply a bootstrapping plan starting with this modified empirical distribution 

P = {Pl,P2,--- ,Pn)- 

To choose the g{-) function, a first election can be (See A. Fires et.al. [2]) a 
member of the family 



v{x,c,s,t) 





1 + 


{x-cf 

7S^ 


[exp 


r (x-c) 

2s'^ 



1 < T < OO 
T = OO 



(9) 



where c is the location parameter (equal to the tuning constant c of expression 
(8)), s is the scale parameter and r is a shape parameter, a reasonable choice 
could be r = 2. 



4.3 Robust Bootstrap FANN Algorithm (RB-FANN) 

In this section we introduce our Robust bootstrap algorithm for feedforward neu- 
ral networks (RB-FANN) that introduce a robust sampling and robust learning 
algorithm for several neural networks created. 
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Robust Bootstrap Algorithm for FANN 

1 . Choose a FANN architecture by fixing the number of hidden parameters A, 
and denote the model by g\ (x, w) . 

2. Train the neural networks with the robust learning algorithm presented in 
section 4.1 with the whole training dataset {xf.,yk}^^i, obtaining the robust 
model gx{x,w^) 

3. Calculate the residuals ik = Vk — 9 \{x,w^), for all the training set 
{xk,Vk}t^i- 

4. Bootstrap loop (6=1,... ,B). 

a) Apply a robust bootstrap resampling plan presented in section 4.2 over 
the center residuals (ei, . . . , £„) generating independently the bootstrap 
errors (£i, . . . ,£*) . 

b) Apply a bootstrap resampling plan over the input vectors {x^, . . . ,x„) 
generating independently the bootstrap input vectors (x*, . . . 

c) Generate the 6-th training data set {x*,y*}, where y* = g\{x*,Wn) + ^i 

d) Train the boostrap FANN with a robust learning algorithm to obtain 
the weight vector and the model g\{x,w^)- 

e) Let b=b-|-l, if there are sufficient (B) bootstrap replications exit loop go 
to 5, else go to 4(a) 

5. For each bootstrap weight vector that represents a trained network, 
we evaluate the function of interest f*^ = f{w^). For example, if we are 
studying the response of the network to a given input x we evaluate the 
output of the 6-th trained network to this input x, i.e., g*\{^ = g\{x,w^). 

6. From the set of bootstrap replications {/*^, . . . , f*^} we can make our in- 
ferences. 

5 Simulation Results 

In this section we show two experimental results where we compare the classical 
bootstrap for FANN [4] with our robust version. We make inference based on 
the bootstrap results, in particular we are interested in the mean and in the a% 
confidence interval information extracted from the empirical distribution of the 
bootstraps predictions. We center our attention in the prediction of a syntethic 
and a real data set. 

In order to make the simulations we need to choose the robust estimators for 
the learning process of the FANN. In this section we select the Huber function, 
which are given by ipH{r,c) = sgn{r)min{\r\, c}. 

5.1 Example #1: Computer Generated Data 

In this section the procedure is applied to computer generated data of a one- 
dimensional input and output space as in [8] , where the true underlying function 
ip{x),x G IR is known and given by: 

(/?(a;) = sin(wQ,a;) sin(r(;^a;), iCq = 2.5, ic/j = 1.5, x G [0, tt] (10) 
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Fig. 1. Bootstrap Inference: Left, several bootstrap approximating the regression 
function. Right, prediction and confidence interval for the regression function by boot- 
strap replications 



The target values are generated by adding noise a{x) ^ iV(0, cr^(x)), where 
cr^(x) = 0.01 -I- 0.25 X [1 — sin(waa;)]^ and by adding additive outliers. The 
observational process Zt is obtained by z = f{x) -I- a -I- uv, where u is a zero-one 
process with P[v 0] = /3, u has distribution = N{0; crj) and 0 < /3 << 1. 
We generated 200 random samples in [0, tt] to train the FANN and we test the 
model with equally spaced 315 data from 0 to tt separated by intervals of 0.01, 
the parameters chosen are cr^ = 1 and /3 = 0%,5%, 10%, 20%. In the right side 
of figure 1 the regression function and the training data are shown. 

A single-input-single-output (SISO) FANN model with one hidden layer with 
ten hidden neurons and logistic activation function for the hidden layer and linear 
activation function for the output neuron was implemented to model the data. 
Two bootstrap algorithms were used, the classical (B) and our robust version 
(RB) with the Huber estimator. To obtain the parameters of the several networks 
the backpropagation with momentum algorithm was used (see [12]). 

The result of both procedure are summarized in table 1, where the perfor- 
mance of the training and test set was measured with the mean square error and 
for the quality of the confidence interval, the probability coverage (PC) and the 
mean length (L) of the interval in the training and test set was computed. The 
PC is the proportion of points of the real data that fall inside the interval and 
the mean length is defined as L = J2k=i ~ where and 

are the superior and inferior limit of the Cl at the point {x^-,yk)- 

As can be noted in table 1, the mean of the bootstrap estimates improves the 
performance of the prediction obtained by the initial net, moreover, the robust 
version (RB) performs better than the classical (B) in most of the cases. If we 
look to the confidence interval, the difference between the coverage probability 
(PC) and the desired confidence a are almost the same for both cases, but the 
robust bootstrap Cl size are smaller than the classical bootstrap. In the figure 
1 it is noted how the Bootstrap mean prediction improves the accuracy of the 
initial net, and how the C.I. contains the regression function when the robust 
bootstrap is applied. 
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Table 1. Performance of the model on the computer generated data, probability cov- 
erage (PC, a = 95% declared) and mean length (L) of the Conhdence Interval results 
for the Synthetic dataset with 200 Bootstrap replications, 200 training patterns and 
315 testing ones with /3% of additive outliers 



Bootstrap 


P% 


net MSE 


Boots. MSE 


PC 


L 


Type 


Outliers 


Test set 


Test set 


Test set 


Test set 


B 


0 


0.0209 


0.0188 


81.71 


0.5774 


RB 


0 


0.0182 


0.0158 


90.47 


0.4217 


B 


5 


0.0098 


0.0112 


93.41 


0.5002 


RB 


5 


0.0139 


0.0113 


93.96 


0.4868 


B 


10 


0.0164 


0.0131 


98.73 


0.6233 


RB 


10 


0.0140 


0.0095 


97.14 


0.4943 


B 


20 


0.0310 


0.0269 


100.00 


0.7849 


RB 


20 


0.0263 


0.0196 


93.97 


0.6580 



5.2 Example Real Data, Boston Housing Dataset 

We now apply our method to a set of observed data, the Boston housing 
dataset. This dataset contains information collected by the U.S. Census Ser- 
vice concerning housing in the area of Boston Mass. It was obtained from the 
StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used 
extensively throughout the literature to benchmarks algorithms. The dataset is 
small in size with only 506 cases, it has 13 input attributes (crime, residential 
land, non-retail business, river, nitric oxides concentration, etc.) and the task is 
to predict the median value of owner-occupied homes. This dataset is character- 
ized by the high level of contaminated data. 

The results are shown in table 2. As can be appreciated the robust bootstrap 
always improve the performance of the initial net, this is not the case of the 
classical bootstrap, and the robust version are always better than the classical. 
If we look to the confidence interval, we obtain similar results in the PC, but 
the robust bootstrap has smaller mean length of CL 



Table 2. Performance of the model on the real data, probability coverage (PC, a = 95% 
declared) and mean length (L) of the Confidence Interval results for the Boston dataset 
with 200 and 20 Bootstrap replications. The sizes of the training set, validation set and 
testing set are 308, 200 and 200 respectively 



[Bootstrap 


net MSE 


Boots. MSE 


PC 


L 1 


Type 


B 


Train 


Val. 


Test 


Train 


Val. 


Test 


Train 


Val. 


Test 


Train 


Val. 


Test 


B 


200 


73 


2242 


3018 


69 


1326 


1450 


77 


94 


97 


20.36 


168.33 


267.37 


RB 


200 


173 


1724 


1782 


194 


743 


601 


68 


90 


99 


22.67 


69.17 


99.43 


B 


20 


73 


2242 


3018 


76 


3511 


4614 


67 


83 


88 


20.56 


177.73 


275.93 


RB 


20 


172 


1724 


1782 


182 


617 


364 


56 


88 


94 


17.09 


55.74 


78.45 
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6 Concluding Remarks 

In this paper we introduce a robust bootstrap technique for estimating confidence 
and prediction intervals for FANN. The Intervals incorporate a significantly im- 
proved estimate of the underling model uncertainty. The bootstrap distribution 
might be a very poor estimator of the distribution of the regression estimates 
because the proportion outliers in the bootstrap sample can be higher than the 
original sample. 

The robust bootstrap plan for FANN based on a robust resampling is easy to 
implement in terms of stability and computational cost (speed of convergence) 
in situations involving multiple dimensional problems. 

The algorithm shown here generates an alteration of the original empirical 
distribution that gives a robust resampling scheme for whole bootstrap proce- 
dure; but it leaves some unsolved problems, as for example, introduce bias to 
the distribution. The performance of our algorithm can be improved by applying 
other inference techniques for the bootstraps results. The results presented in 
this paper can be easily extended to FANN with a higher number of layers and 
output neurons. Further studies are needed in order to apply this technique to 
nonlinear time series, where the data are correlated. 
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Abstract. Kernel Maximum Likelihood Hebbian Learning Scale Invariant 
Maps is a novel technique developed to facilitate the clustering of complex data 
effectively and efficiently and that is characterised for converging remarkably 
quickly. The combination of Maximum Likelihood Hebbian Learning Scale 
Invariant Map and the Kernel Space provides a very smooth scale invariant 
quantisation which can be used as a clustering technique. The efficiency of this 
method have been used to analyse an oceanographic problem. 



1 Introduction 

Kernel Maximum Likelihood Hebbian Learning Scale Invariant Map (K-MLSIM) is 
based on a modification of a new type of topology preserving map that can be used 
for scale invariant classification [6]. Kernel models were first developed within the 
context of Support Vector Machines [16]. Support Vector Machines attempt to 
identify a small number of data points (the support vectors) which are necessary to 
solve a particular problem to the required accuracy. Kernels have been successfully 
used in the unsupervised investigation of structure in data sets [15], [11], [9]. Kernel 
methods map a data set into a Feature space using a nonlinear mapping. Then 
typically a linear operation is performed in the feature space; this is equivalent to 
performing a nonlinear operation on the original data set. The Scale Invariant Map is 
an implementation of the negative feedback network to form a topology preserving 
mapping. A kernel method is applied in this paper to an extension of the Scale 
Invariant Map (SIM) which is based on the application of the Maximum Likelihood 
Hebbian Learning (MLHL) method [4] and its possibilities are explored. The 
proposed methodology groups cases with similar structure, identifying clusters 
automatically in a data set in an unsupervised mode. 



2 Kernel Scale Invariant Map 

This section reviews the techniques used to construct the K-MLSIM method. 
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2.1 Scale Invariant Map 

Consider a network with N dimensional input data and having M output neurons. 
Then the activation of the output neuron is given hy: 

^ ( 1 ) 

act,=2^WijXj 

1=1 

Now if we invoke a competition between the output neurons, it is possible to have 
a number of different competitions between output neurons, two obvious examples 
are: 

Type A: The neuron with greatest activation wins. 

Type B: The neuron closest to the input vector wins. 

The Kohonen network typically uses the second since the first requires specific 
renormalisation to ensure that all neurons have a chance of winning a competition. 
We have shown that the scale invariant mapping produced by the new network does 
not require any additional competition limiting procedure when using the first 
criterion. In both cases, the winning neuron, the p'*, is deemed to be maximally firing 
(=1) and all other output neurons are suppressed(=0). Its firing is then fed back 
through the same weights to the input neurons as inhibition. 

Bj Xj-Wj,j.\_for_all_j (2) 

where p is the winning neuron. Now the winning neuron excites those neurons close 
to it i.e. we have a neighbourhood function A{p,f) which A(p,j)<A(p,k) for all 

j,k:\p- _/| > \\p - k\\ where ||.|| is the Euclidean norm. In the simulations described in this 

paper, we use a Gaussian whose radius is decreased during the course of the 
simulation. Then simple Hebbian learning gives 

Awy = 7i,A(p,i).ej = 7i,A(p,i).(Xj - Wpj) (3) 

where we have used Xj as the activation of the /* input neuron and w.j is the weight 
between this and the i'* output neuron. For the winning neuron, the network is 
performing simple competitive learning but note the direct effect the p"" output 
neuron's weight has on the learning of other neurons. This algorithm introduces 
competition into the same network used in [5] to perform a Principal Component 
Analysis (PCA) and in [7] to perform an Exploratory Projection Pursuit (EPP). 



2.2 Kernel K-Means Clustering 

We will follow the derivation of [14] who has shown that the k means algorithm can 
be performed in Kernel space. The basic idea of the set of methods known as kernel 
methods is that the data set is transformed into a nonlinear feature space y,-x^ 

Any linear operation now performed in this feature space is equivalent to a nonlinear 
operation in the original space. 

The aim is to find k means, , so that each point is close to one of the means. 
First we note that each mean may be described as lying in the manifold spanned by 
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the observations, ^(x ) i.e. rn^ = ) ■ Now the k means algorithm chooses the 

means, m^, to minimise the Euclidean distance between the points and the closest 
mean 



IW* 



= ic(x,x) - 2^ ^ ,Xj) 



(4) 



i.e. the distance calculation can be accomplished in Kernel space by means of the K 
matrix alone. Let be the cluster assignment variable, i.e. M.^=l if (p{x . ) is in the 

p"' cluster and is 0 otherwise. [14] initialises the means to the first training patterns 
and then each new training point, + 1 > A: , is assigned to the closest mean 

and its cluster assignment variable calculated using 

M,n.a=\ " " 

0 _ otherwise 



In terms of the kernel function (noting that k(x,x) is common to all calculations) 
we have 






1 - if _ X- , - 2Xi ) 

0 _ otherwise 



(6) 



We must then update the mean, ma to take account of the (t+l)‘ data point 



(7) 



where we have used the term to designate the updated mean which takes into 
account the new data point and 

( 8 ) 

Now (7) may be written as 

) = X )+ )- X ) 

which leads to an update equation of 

C - for _i = t + \ 



(9) 



( 10 ) 



2.3 Kernel Self Organising Map 

We have previously used the above analysis to derive a Self Organising Map [10] in 
Kernel space. The SOM algorithm is a k means algorithm with an attempt to 
distribute the means in an organised manner and so the first change to the above 
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algorithm is to update the closest neuron's weights and those of its neighbours. Thus 
we find the winning neuron (the closest in feature space) as above but now instead of 
(6), we use 

=A(«,a),V^ (11) 

where a is the identifier of the closest neuron and A{a,ju) is a neighbourhood function 
which in the experiments reported herein was a gaussian. Thus the winning neuron 
has a value of M= 1 while the value of M for other neurons decreases monotonically 
with distance (in neuron space) away from the winning neuron. For the experiments 
reported in this paper, we used a one dimensional vector of output neurons numbered 
1 to 20 or 30. The remainder of the algorithm is exactly as reported in the previous 
section. 



2.4 The Kernel Scale Invariant Map 



We will, in this Section, consider only linear kernels though the extension to other 
kernels is straightforward. Since every point in data space can be represented by a 

linear combination of the training data set “ f -' a} ^ have ^ ~ for 

all X in the data space. Similarly the weight vectors can be represented in the same 
way. Thus we can represent 







( 12 ) 



as 



j ‘ j i 

Now we wish to have a competition to find out which output will win for a 
particular input, Xj, and so the first question to be addressed is the nature of the 
competition. If we are working in data space with the above overcomplete basis, then 
every member of the training set is representable as a vector which is all zeros except 
for the t position which is set to 1. Therefore we identify the t column of the kernel 
matrix, and determine the output which wins the competition to represent x, as 

a = argmax^ = argmax^ = argmax^ k, v^ (14) 

where we have used k, as the vector from the t column of the kernel matrix. 



We have experimented with two methods: 

1 . The Kernel method of the previous section. With the notation above, we have 

[C -M = ^ 



(15) 



with defined as for the Kernel SOM. This continues to be one-shot learning with a 
subsequent gradual decay. 

2. The neural method used with the standard SIM. Note that since we are working 
in the space spanned by the data as basis vectors, the input vector has a zero 
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everywhere but the t position and we are subtracting v (a being the winning neuron). 
Thus 



e = x-v^ (16) 

in this basis. We then apply the standard learning rule so that 

Av^=T]A{a,fih (17) 

Even though this method is an iterative method as is usual in neural methods, we 
find that only a few (less than 10, often just 2 or 3) iterations through a data set are 
enough to form the mapping. 



3 Maximum Likelihood Hebbian Learning 

We have previously [2, 10] considered a general cost function associated with a 
negative feedback PCA network. 

J =\^ E\x-Wyf] 

If the residual after the feedback has probability density function 

p(e) = (^exp(-|e|'’) (19) 

Then we can denote a general cost function associated with this network as 

y=-log/7(e)=|e|'’ +K (20) 

where K is a constant. Finding the minimum of J corresponds to finding the maximum 
of pfe) i.e. we are maximising the probability that the residual comes from a particular 
distribution. We do this by adjusting the weights. Therefore performing gradient 
descent on J we have 

AW = y(p I e I'’"' sign{e)f 

dW 9e dW 

We would expect that for leptokurtotic residuals (more kurtotic than a Gaussian 
distribution), values of p < 2 would be appropriate, while platykurtotic residuals (less 
kurtotic than a Gaussian), values of p > 2 would be appropriate. We have previously 
[8], [3], [13] shown that this network can perform EPP. 



3.1 Application to Kernel Scale Invariant Map 

Now the SIM was originally derived by introducing competition to the negative 
feedback PCA network. Therefore we introduce the MLHL concept of the last section 
to the Scale Invariant Map (SIM). Consider the feedback in 16. Let us present a 
particular input, Xj= (0,..,0,1,0,..0) which has 1 in the t position, to the network. Then 
if neuron a wins the competition, it is because the weights v have a high dot product 
with the elements of I's column of the k matrix; either v j is large or v ^ (for input k 
which will be grouped with 1) is large. Thus the residuals after feedback will have a 
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bimodal distribution - either the residuals will tend towards 0 or the residuals will be 
large. 

This suggests maximising the likelihood that the residuals come from a suh- 
gaussian distribution; therefore, we proposed the learning rules 

( 22 ) 

This has an interesting effect on the learning rules in that a pie slice of the data is 
learned but the actual positions of the neuron centres themselves (when transformed 
back into data space) lie outside the data set. This enables a very smooth scale 
invariant quantisation as shown in Figure 1. The v weight vectors are shown in 
Figure 2. The data set, drawn uniformly from [-1,1 ]*[-!,!], is shown by the crosses. 
The weights of the converged Kernel Scale Invariant Map have been joined to form 
almost a circle. The corresponding vector in data space based on the data as basis 
vectors is shown in Figure 2 each line of the diagram represents the weights of one 
output neuron in terms of the data points (the v weights in fact). 



Fig. 1. The data set is shown by the red crosses. It was drawn uniformly from[-l,l]*[-l,l]. The 
weights of the converged KSIM have been joined to form almost a circle. 



iPI ?’} iifir’iii’ I i I'v i ! iiifi. I niii'ii* .n m'i'i ll* .i?:T } 3 * i 

iPVi'ij ■'I, 

P]\ 'M.iiriilHi’i! 'i i »"i" i. '[ liifiil' ii'ili i'lF , ' ' i .h '’L.ui ,i »'• .*'iF Lif” * 

ii i i ii ii r i i i r. i i i iii f ii iii i-' ii j i i ijh a ji it ii li li i i ii i 



Fig. 2. The V weights as represented in the data basis. Each line is the weight vector into an 
output neuron and is shown. 



We have found that even a very small departure from the standard K-SIM 
parameter (with p=2) gives a visible change to the representation of the data set. In 
the top left of Figure 3, we show the converged weights after 7 iterations of the K- 
SIM algorithm when p=1.8; the weights are well enclosed in the data. In the top right 
of that figure, we show the weights when p=2. 1 was used; the weights are beginning 
to move outside the data set. In the bottom figure, we show what happens when 
p=2.5. Now the weights are well outside the data. 




Fig.3. The left figure shows the weights after 7 iterations of the K-SIM algorithm with p=1.8 on 
the standard artificial data set. The one in the middle shows the weights with p=2.1. The right 
figure shows the weights when p=2.5. 

Consider the situation in which there are n points in the pie slice won by y . 
Without loss of generality let us write w =(aj,a2,...,a,,,0,0,...,0) i.e. the vector w has 
non-zero components corresponding to the points (in the training set), Xj,..., x_^ while 
its components corresponding to are all zero. Then 
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= 77A(ctr,/;)i;gn(e|e|^ * 

Let us consider only the situation in which the points are presented to the 

network and so the output y is the winner; thus = 1 ■ Let point Xj be presented 

and so x=(l,0,0,...,0). Then 



e = (l-a,,-a2,...,-a„, 0,...,0) 


(24) 


When point x^ is presented, x=(0,l,0,...,0) and 

e = (-a,,l-a2,...,-a„, 0,...,0) 


(25) 



We will consider the effect of the update rules on this weight vector for different 
values of p. 

p=l. Focus now on the first element of w, the element of the weight vector linking 
input Xj to output y, then at convergence £:(Aw^j) = E{sign{e^))= 0 ■ Clearly if aj<0, 

sign(6j) is always positive. Therefore aj>0 .Thus there can only be two non-zero 
elements in this vector. Impossible. 

p=2. This is the standard K-SIM. Then at convergence e{Aw^i) = E{sign{e^Je^\')=0 
and so 

l-ai-H(n-l)*(-aj)=0 (26) 



Thus aj=l/n. This argument applies equally to all non-zero elements of w and so 
yy=ri_i_ J_Q which when we translate back to the original basis means 



\n n n J 

that the centre of the KSIM is given by 



1 1 1 (27) 

Ca =-Xi+-X2-f...-f-X„ 

n n n 



the mean of the data points for which neuron a is responsible for representing. 
p=3. At convergence £(Aw„i) = E[sign{e^}e^\^)= 0 and so, for 0<aj<l, 

(l-a/-(n-l)*(a,)=0 (28) 



Solving this, we find that „ _ -l±A/(n-l) . Thus for n=10, a, = — or .In 

n-2 '42 

practise, we have never seen the latter result but it seems, in principle, possible. Note 
that the solution is an equally weighted sum of the data points where the weights are 

greater than — . Thus the centre, = C * where C is a constant and is the 
n 

mean of the data points defined in 26. 

This gives the broad picture. Intermediate values of p will give solutions 
somewhere between the results given above. In general, at convergence 
E{Aw^^)= E(sign{eijei\'’ *)=0 and so 
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Fig. 4. As we vary p from 1 to 2, a^ climbs slowly to 0.1 (left figure) and grows more slowly 
subsequently (right figure). 



(l-a/-‘ - (n-l)*(a/- =0 



(29) 



if aj>0. This implies 



a^ = 




/ 1 + («-!)''“* 



(30) 



The effect of varying p with n=10 is shown in Figure 4. We see that for values 
close to 1, a, remains close to 0, while for values approaching 2, we reach the 1/10 
value. Subsequently, Uj continues to rise but shows signs of levelling off (right figure). 



4 Experiments Using Real Data Using the K-MUSIM and 
Conclusions 

In the following section we detail the results of the K-MLSIM on the red tides data. 
The K-MLSlM is specially suited to be used on the red tides data because of the 
flexibility of the p-parameter. Changing this parameter affects the update of the 
centres and it determines how far each centre will move. This property allows the 
selection of different clustering combinations, penalising, or accentuating the 
representation of outliers in the clustering. In the case of the red tides data, as each 
instance of a red tide would be considered an outlier, it is necessary for us to use an 
algorithm to deal with these important data points in an appropriate manner. In this 
section we will show that the K-MLSIM is a powerful method that can extract the 
relevant clustering information. 

We have shown previously that the E-insensitive SIM [12] in data space, which is 
equivalent to using a p value of 0, will place each centre in the median of the cluster. 
This property was extended by [1] to use different values of p to achieve different 
clustering combinations. The larger the value of p the greater the contribution outliers 
will have in the clustering. 



Fig. 5. a 

Fig. 5. Clustering of K-SIM on red tides data using 100 centres and a value of p = 0 (Fig. 5a), a 
value of p = 0.5 (Fig 5b) and a value of p = 2 (Fig 5c). 
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The effect that this parameter has on the clustering is important as it allows us to 
change the clustering to be more representative of the properties of the red tides data 
as it contains small numbers of outliers which are very important for the 
classification. If we were to cluster with a contemporary algorithm, we would likely 
be left with fewer centres that identify instances of red tides than we would like. In 
contrast the K-SIM can be tailored to give us more centres identifying red tides, 
resulting in a more expressive clustering. 

Figure 5 shows the clusters produced by the K-MLSIM with different values of the 
p parameter. This parameter will penalize the effect of outliers in the data. This results 
in the centres being placed in the median of the data cloud. 

As we can see from Figure 5a, the clustering is a compact coding with at most 800 
data samples being assigned to one centre. This coding is less interesting for the red 
tides data as the abnormal points are those which are important, and so we wish to 
find a clustering which promotes rather than penalises them. In the figure 5b we use a 
value of p = 0.5 which places more emphasis on the larger changes in the weights, 
which in turn means that outliers, or abnormal data, will have greater effect on the 
learning. Thus we can ensure that the data points representing red tides will be 
strongly represented in the clustering, which is ideal for this problem. In Figure 5b we 
can see that there is a more sparse representation as there is a greater emphasis placed 
on large differences between the winner and the input data. In this figure we can see 
that the dense clusters are assigned fewer centres and that less dense clusters that 
contain outliers are more favoured. Assigning more centres to the less dense clusters 
also allows us to get a better representation within clusters. 

Figure 5c. shows the clustering of the K-MLSIM on the red tides data using a p 
value of 2. The K-MLSIM is more penalizing to small changes in the weights than 
with p = 0.5. As can be seen from figure 5c it has provided an even more sparse 
representation of the clustering. This is a far more suitable clustering of the red tides 
data where we used smaller values of p. By changing the value of p in the weight 
update rule the K-MLSIM can be adapted to penalize or promote outliers in its 
clustering. 

We have demonstrated a new technique for clustering. Of interest too is the fact 
that the method allows investigation of the nonlinear projection matrix K that readily 
reveals when a new situation behaves similarly, which may be very important in the 
identification of toxin episodes in coastal water. 
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Abstract. Several research works have shown that Artificial Neural 
Networks — AN Ns — have an appropriate indnctive bias for several 
domains, since they can learn any input-output mapping, i.e., AN Ns 
have the universal approximation property. Althongh symbolic learn- 
ing algorithms have a less flexible inductive bias than AN Ns, they are 
needed when a good understating of the decision process is essential, 
since symbolic ML algorithms express the knowledge indnced nsing sym- 
bolic strnctures that can be interpreted and understood by hnmans. On 
the other hand, AN Ns lack the capability of explaining their decisions, 
since the knowledge is encoded as real-valued weights and biases of the 
network. This encoding is difficult to be interpreted by humans. Aiming 
to take advantage of both approaches, this work proposes a method that 
extract symbolic knowledge, expressed as decision rules, from AN Ns. The 
proposed method combines knowledge induced by several symbolic ML 
algorithms through the application of a Genetic Algorithm — GA. Our 
method is experimentally analyzed in a number of application domains. 
Resnlts show that the method is able to extract symbolic knowledge 
having high fidelity with trained AN Ns. The proposed method is also 
compared to TREPAN, another method for extracting knowledge from 
AN Ns, showing promising results. 



1 Introduction 

Artificial Neural Networks — AN Ns — have been successfully employed in several 
application domains. However, the comprehensibility of the induced hypothesis 
is as important as its performance in many of these applications. This is one 
of the main criticism about AN Ns: the lack of capability for explaining their 
decisions since the knowledge is encoded as real- valued weights and biases. On 
the other hand, the comprehensibility of the induced hypothesis is one of main 
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characteristics of symbolic Machine Leaning — ML — systems. This work ex- 
plores the lack of comprehensibility of the models induced by AN Ns, proposing 
solutions for the following problem: 

Given a model produced by a learning system, in this case AN Ns, and 
represented in a language that is difficult to be understood by the majority 
of the users, how to re-represent this model in a language that improves 
comprehensibility in order to be easily understood by an user. 

Several research works have investigated how to convert hypotheses induced by 
AN Ns to more human comprehensible representations, as surveyed in [1]. The 
majority of the proposed methods has several limitations, such as: can only be 
applied to specific network models or training algorithm; do not scale well with 
the network size; and are restricted to problems having exclusively discrete- 
valued features. 

This work proposes the use of symbolic ML systems and GAs to extract 
comprehensible knowledge from AN Ns. The main goal of this work is to obtain 
a symbolic description of an ANN that has a high degree of fidelity with the 
knowledge induced by the ANN. The proposed method can be applied to any type 
of ANN. In other words, the method does not assume that the network has any 
particular architecture, nor that the ANN has been trained in any special way. 
Furthermore, the induction of symbolic representations is not directly affected 
by the network size, and the method can be used for applications involving both 
real- valued and discrete- valued features. 

This paper is organized as follows: in Section 2 is described the proposed 
method to extract comprehensible knowledge from AN Ns; in Section 3 the 
method is experimentally evaluated on several application domains; and finally, 
in Section 4 the main conclusions of this work are presented, as well as some 
directions for future work. 



2 Proposed Method 

The method proposed in this work uses symbolic ML algorithms and GAs to 
extract symbolic knowledge from trained AN Ns. In this method, an ANN is 
trained over a data set E. After the training phase, the ANN is used as a “black 
box” to classify the data set E creating a new class attribute. The values of 
the new class attribute reflect the knowledge learned by the ANN, i.e., these 
values reflect the hypothesis h induced by the ANN. The data set labelled by the 
trained ANN is subsequently used as input to p symbolic ML systems, resulting 
in p symbolic classifiers h^, . . . h^. Each classifier h', 1 < i < p approximates 
the hypothesis h induced by the ANN. 

Unfortunately, each symbolic M L system represents the induced concept in a 
different language, hindering the integration of these classifiers. Thus, it is neces- 
sary to translate the representation of these classifiers to a common language. In 
order to make such translation, we used the work of Prati [7], which proposes a 
standard syntax called VBM to represent rules. Thus, each symbolic classifier is 
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translated to the VBM format. After the classifiers ^ 2 , • . . are converted 
to the VBM syntax, they are integrated into a rule database. An unique natural 
number is assigned to each rule in the rule database in order to identify each 
rule. 

The rules stored in the rule database are used to form the individuals of the 
GA population. Each individual is formed by a set of rules. The representation 
of each individual is a vector of natural numbers, where each number is the 
identifier of a rule in the rule database. The initial population of the GA is 
composed by vectors with random numbers, representing sets of random rules 
from the rule database. 

During the GA execution, the individuals, i.e. the rule sets, are modified 
by the mutation and crossover operators. The mutation operator randomly ex- 
changes a rule from one of the rule sets by another rule from the rule database. 
The crossover operator implemented is asymmetric, i.e., given two rule sets, two 
crossover points are chosen. The sub- vectors defined by the two crossover points 
are exchanged. It is important to note that as the crossover operator is asym- 
metric, even though all initial individuals have the same number of rules, the 
selection of the most adapted individuals may conduct to the survival of larger 
or smaller rule sets. The GA fitness function calculates the infidelity rate between 
the ANN and each individual. Thus, the objective of the GA is to minimize the 
infidelity rate. Infidelity rate is the percentage of instances where the classifica- 
tion made by an ANN disagrees with the classification made by the method used 
to explain the ANN. 

Another important issue is how to classify a new instance given a rule set. The 
strategies employed to classify an instance given a set of rules are: SingleRule, 
that uses the classification given by the rule with highest prediction power; and 
MultipleRules that uses all fired rules in order to classify an instance. After the 
execution of the GA, one individual winner is obtained. The set of rules is usually 
composed by rules obtained from classifiers induced by different symbolic ML 
systems. This set of rules is then used to explain the behavior of the ANN. A 
post-processing phase may be applied to the winner. The objective of this post- 
processing is to remove those rules from the winner that are not fired for any 
instance in the training and validation sets. Additional details about the GA 
implementation can be found in [6]. 

3 Experimental Evaluation 

Several experiments were carried out in order to evaluate the proposed method. 
Experiments were conducted using six data sets, collected from the UGI reposi- 
tory [2] . These data sets are related to classification problems in different applica- 
tion domains. Table 1 shows the corresponding data sets names and summarizes 
some of their characteristics. The characteristics are: ttlnsteinces - the total 
number of instances; #Features - the total number of features as well as the 
number of continuous and nominal features; Class and Class - the class val- 
ues and distribution of these values; Majority Error - the majority error; and. 
Missing Values - if the data set has missing values. 
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Table 1. Data sets summary description. Table 2. Error rate 



obtained by the 



Data 

Set 


^Instances 


:#:Features 

(cont.,nom.) 


Class Class % 


Majority 

Error 


Missing ANNs - 
Values ^ 


— mean and 
d deviation. 


breast 


699 


9(9,0) 


benign 65.52% 

malignant 34.48% 


34.48% 


ygg standa-r 


crx 


690 


15(6,9) 


= 55.50% 

+ 44.50% 


44.50% 


yes breast 

crx 


2.98 (0.46) 
13.47 (1.01) 
17.78 (2.12) 
22.92 (1.15) 
20.14 (3.21) 
3.69 (0.86) 


heart 


303 


13(6,7) 


absence 54.13% 
presence 45.87% 


45.87% 


yss heart 

pima 


pima 


768 


8(8,0) 


0 65.02% 

1 34.98% 


34.98% 


no sonar 

votes 


sonar 


208 


60(60,0) 


M 53.37% 

R 46.63% 


46.63% 


no 


votes 


435 


16(0,16) 


republiccin 54.80% 
democrat 45.20% 


45.20% 


no 





The experiments were divided into three phases. In the following sections 
these phases are described in the same order that they were conducted and the 
results obtained in each phase are also presented. 

3.1 Phase 1 

The objective of Phase 1 is to train the AN Ns, whose knowledge will be extracted 
in the next phases. Phase 1 was performed as follows: 

1. Each of the six data sets was divided using the 10-fold stratified cross- 
validation resampling method [10]. Each training set was divided in two 
subsets: training set (with 90% of instances) and validation set (with 10% 
of instances). 

2. All the networks were trained using the Backpropagation with Momentum 
algorithm [9] and a validation set was used to decide when the training should 
stop. After several experiments, the architectures chosen were: breast (9- 
3-1), crx (43-7-1), heart (22-1-1), pima (8-2-1), sonar (60-12-1) and votes 
(48-1). 

3. The error rate for each data set was measured on the test set. Table 2 shows 
the mean error rates obtained in the 10 folds, and their respective standard 
deviation between parenthesis. 



3.2 Phase 2 

In this phase, the AN Ns trained in Phase 1 are used to label the data sets, 
i.e., they are used to create a new class-attribute. The results obtained by the 
proposed method are compared with another method for extracting knowledge 
from AN Ns, the TREPAN [4]. Given a trained ANN, and the training set used 
for its training, the TREPAN method builds a decision tree to explain the ANN 
behavior. TREPAN uses the trained ANN to label the training set and builds a 
decision tree based on this data set. TREPAN can also generate artificial data 
automatically, using the trained ANN to label the new instances. Unlike most 
decision tree algorithms, which separate the instances of different classes by using 
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Table 3. Parameter values employed in the ex- Table 4. GA’s parameter values 
periments with TREPAN. employed in the experiments. 





MinSample 


SplitTest 


simpleO 


0 


simple 


simplelOOO 


TDDU 


simple 


motnO 


0 


mofn 


mofnlOOO 


TDDU 


mofn 



parameters 


values 


Ug 


20 




20 


U 


10 


Pc 


0.25 


Pm 


0.01 



a single attribute to partition the input space, TREPAN uses m-of-n expressions 
for its splits. A m-of-n expression is a boolean expression which is satisfied when 
at least m (an integer threshold) of its n conditions (boolean conditions) are 
satisfied. Phase 2 was performed as follows: 

1. TREPAN was executed with different values assigned to its main parameters, 
including its default parameters, as showed in Table 3. The MinSample pa- 
rameter of TREPAN specifies the minimum number of instances {i.e. train- 
ing instances plus artificial instances) to be considered before selecting each 
split. The default value is 1000. When this parameter is set to 0, no artificial 
instance is generated by TREPAN. The SplitTest parameter defines if the 
splits of the internal nodes are m-of-n expressions (mofn) or simple splits 
(simple). The option mofnlOOO is the default option used by the TREPAN 
method, that is, m-of-n expressions are employed and 1000 instances must 
be available in a node for this node to be either expanded or converted into 
a leaf node. 

2. The infidelity rate between TREPAN and the ANN was measured on the test 
set. 

3. The syntactic comprehensibility of the knowledge extracted by TREPAN was 
measured. In this work, the syntactic comprehensibility was measured con- 
sidering the number of induced rules and the average number of conditions 
per induced rule. 

4. The training, validation and test sets used to train the AN Ns in Phase 1 were 
labelled by the corresponding AN Ns. 

5. The symbolic ML systems C4.5, C4.5rules [8] and CN2 [3] were chosen to 
be used in the experiments. These systems are responsible for generating 
the rules for the proposed method, i.e., the rules that will be integrated 
by the GA. The C4.5, C4.5rules and CN2 were executed with their default 
parameters. 

6. The infidelity rate between the symbolic ML systems (C4.5, C4.5rules, and 
CN2) and the ANN was measured on the test set, as well as the syntactic 
comprehensibility of the models induced by the symbolic inducers. 
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3.3 Phase 3 

In this phase, the classifiers produced in Phase 2 by the symbolic ML systems 
C4.5, CN2 e C4.5rules on the training set labelled by the AN Ns are used to form 
the GA individuals. Phase 3 was carried out as follows: 

1. The classifiers induced by the symbolic ML systems C4.5, C4.5rules e CN2 
were converted to the VBM syntax. 

2. For each of the six data sets employed in this experiment, a GA was executed 
several times, varying their parameters as well as the strategies used to 
classify an instance given a set of rules. Table 4 shows the values used for 
the parameters: Ug (number of gerations); (number of individuals); ti 
(initial size of individual, that is, the number of rules of each individual); Pc 
(probability of crossover) and Pm (probability of mutation), used with the 
approaches SingleRule and MultipleRules in the GA. The values Ug = 40, 
Hi = 15 and Pc = 0.4 were also used with the approach SingleRule. 

3. Finally, the infidelity rate and the syntactic comprehensibility for the indi- 
vidual winner were measured. 

Table 5 shows the infidelity rate (mean and standard deviation) obtained 
by the symbolic ML systems, TREPAN and GA. Table 6 shows the number of 
induced rules and Table 7 shows the average number of conditions per rule. 
The results of infidelity rate, number of induced rules and mean number of 
conditions per induced rule, for the symbolic ML systems and TREPAN, were 
obtained in Phase 2. SingleRule means that the GA was executed with the 
SingleRule strategy and SingleRulePP refers to the same strategy followed 
by the post-processing; MultipleRules means the MultipleRules strategy and 
MultipleRulesPP refers to the same strategy followed by the post-processing. 
The GA was executed with the parameters Ug = 20, rij = 20, ti = 10, Pc = 0.25 
and Pra = 0.01 to SingleRule, SingleRulePP, MultipleRules and Multiple- 
RulesPP. In *SingleRule and ^SingleRulePP the parameters are Ug = 40, 
Hi = 15, ti = 10, Pc = 0.4 and Pm = 0.01 

In what follows, the best results are shown in boldface. The 10-fold cross- 
validated paired t test [5] was used to compare the infidelity rate and compre- 
hensibility. According to the 10-fold cross-validated paired t test, the difference 
between two algorithms is statistically significant with 95% of confidence if the 
result of this test is greater than 2.262 in absolute value. The f symbol indi- 
cates that a difference is significant with 95% of confidence when the method in 
boldface is compared to the other methods. 

Table 8 shows the methods in ascending order of infidelity rate. Table 9 
presents the methods in ascending order of mean number of induced rules, and 
Table 10 shows the methods in ascending order of mean number of conditions 
per induced rule. The numbers in the right side of the method’s name indicate 
how many significant results, with 95% of confidence, the method obtained when 
compared with the remaining methods. It can be observed that: 

— For the breast data set, even though the method SingleRule obtained the 
best result for the infidelity rate with 3 significant results when compared 
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Table 5. Infidelity rate (mean and standard deviation). 





breast 


crx 


heart 


pima 


sonar 


votes 


C4.5 


t 2.93 (0.53) 


2.91 (0.81) 


t 14.07 (1.54) 


8.85 (1.07) 


22.98 (3.21) 


3.21 (0.97) 


C4.5rules 


2.48 (0.53) 


3.06 (0.72) 


10.00 (1.11) 


8.46 (1.01) 


23.43 (3.25) 


2.98 (1.07) 


CN2 


1.90 (0.62) 


t 5.82 (0.75) 


t 13.70 (1.99) 


9.12 (0.94) 


22.57 (3.35) 


2.75 (0.95) 


simpleO 


t 3.37 (0.44) 


3.36 (0.71) 


t 14.07 (1.81) 


t 10.81 (0.65) 


24.90 (3.06) 


3.67 (1.03) 


simplelOOO 


t 4.39 (0.95) 


5.37 (1.85) 


12.22 (2.53) 


8.86 (1.15) 


22.14 (2.91) 


2.30 (0.96) 


mofnO 


2.78 (0.63) 


3.83 (0.84) 


12.22 (1.75) 


10.42 (1.31) 


25.57 (3.15) 


2.99 (0.97) 


mofnlOOO 


3.07 (0.80) 


4.75 (1.49) 


7.04 (1.40) 


7.95 (1.01) 


t 27.36 (2.73) 


2.76 (0.75) 




GA: rig = 20, rij — 20, t.j = 10, Pc ~ 0.25, Pm — 0.01 


SingleRule 


1.61 (0.55) 


3.52 (0.65) 


12.22 (1.57) 


8.85 (1.00) 


22.05 (3.24) 


2.76 (0.66) 


SingleRulePP 


2.20 (0.70) 


4.44 (0.66) 


12.59 (1.58) 


10.42 (1.06) 


t 24.93 (3.36) 


3.91 (1.07) 


MultipleRules 


2.04 (0.58) 


3.22 (0.67) 


13.33 (1.48) 


8.20 (0.97) 


t 25.00 (2.11) 


2.99 (0.83) 


MultipleRulesPP 


2.04 (0.58) 


3.22 (0.67) 


13.33 (1.48) 


8.20 (0.97) 


t 25.00 (2.11) 


2.99 (0.83) 




GA: rig = 40, rii = 15, = 10, pc = 0.4, pm ~ 0.01 


♦SingleRule 


1.75 (0.64) 


3.68 (0.66) 


8.89 (0.99) 


8.60 (1.10) 


23.05 (2.74) 


2.98 (1.32) 


♦SingleRulePP 


2.63 (0.71) 


4.90 (0.60) 


11.48 (2.24) 


8.34 (1.21) 


21.07 (2.24) 


4.15 (1.37) 



Table 6. Number of induced rules (mean and standard deviation). 





breast 


crx 


heart 


pima 


sonar 


votes 


C4.5 


t 8.90 (0.91) 


t 14.10 (3.44) 


t 17.40 (1.33) 


t 25.00 (0.92) 


t 13.50 (0.34) 


t 9.40 (1.07) 


C4.5rules 


t 8.00 (0.54) 


t 8.70 (1.58) 


t 11.40 (0.8) 


t 18.30 (1.16) 


t 8.10 (0.38) 


5.80 (0.44) 


CN2 


t 12.10 (0.55) 


t 13.50 (1.42) 


t 16.00 (0.75) 


t 24.40 (0.97) 


t 25.20 (0.53) 


t 13.30 (0.67) 


simpleO 


t 9.40 (0.48) 


t 31.50 (4.81) 


t 18.10 (0.99) 


t 26.10 (1.48) 


t 12.40 (0.60) 


t 13.60 (0.90) 


simplelOOO 


t 10.60 (0.73) 


t 31.00 (8.03) 


t 20.00 (1.92) 


t 29.30 (2.01) 


t 9.50 (1.10) 


t 11.00 (1.61) 


mofnO 


t 7.40 (0.78) 


t 11.10 (1.76) 


t 9.20 (0.83) 


t 23.30 (1.38) 


t 6.90 (0.53) 


t 7.10 (0.53) 


mofnlOOO 


2.40 (0.31) 


t 9.90 (1.94) 


5.30 (0.83) 


t 26.00 (1.55) 


4.10 (0.67) 


6.50 (0.72) 




GA: rig = 20, = 20, = 10, pc = 0.25, pm = 0.01 


SingleRule 


t 14.60 (1.12) 


t 16.20 (3.15) 


t 21.50 (2.26) 


t 18.40 (1.97) 


t 20.80 (3.86) 


t 15.00 (1.62) 


SingleRulePP 


t 7.00 (0.47) 


4.50 (0.96 


7.90 (0.74) 


9.60 (1.37) 


t 7.60 (1.24) 


5.30 (0.67) 


MultipleRules 


t 14.50 (1.92) 


t 19.80 (4.29) 


t 18.40 (3.06) 


t 30.20 (3.39) 


t 22.60 (2.91) 


t 14.00 (1.29) 


MultipleRulesPP 


t 14.50 (1.92) 


t 18.80 (4.08) 


t 18.10 (2.98) 


t 30.20 (3.39) 


t 21.60 (2.71) 


t 12.90 (1.22) 




GA: rig = 40, ni = 15, ti = 10, pc = 0.4, pm = 0.01 


♦SingleRule 


t 17.60 (1.33) 


t 20.30 (5.08) 


t 24.30 (2.03) 


t 42.00 (3.71) 


t 31.60 (2.09) 


t 14.40 (0.95) 


♦SingleRulePP 


t 7.60 (0.56) 


t 5.70 (1.07) 


t 10.80 (0.87) 


t 19.60 (1.65) 


t 11.80 (1.45) 


6.10 (0.64) 



with the remaining methods, this method did not obtain good results for the 
number of induced rules and for the mean number of conditions per induced 
rule. The method SingleRulePP, even obtaining only 1 significant result 
related to the infidelity rate, obtained 8 significant results for the number 
of induced rules and 7 significant results for the mean number of conditions 
per induced rule. 

— For the crx data set, the method C4.5rules obtained good results for the 
number of induced rules and for the mean number of conditions per induced 
rule. For the infidelity rate, this method obtained only one significant result. 
However, the other methods were not able to obtain many significant results. 
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Table 7. Average number of conditions per induced rule (mean and standard devia- 
tion). 





breast 


crx 


heart 


pima 


sonar 


votes 


C4.5 


t 3.63 (0.17) 


2.62 (0.46) 


t 3.25 (0.06) 


t 5.66 (0.11) 


t 4.27 (0.11) 


2.68 (0.20) 


C4.5rules 


2.57 (0.05) 


2.17 (0.10) 


2.41 (0.04) 


2.92 (0.05) 


t 2.84 (0.13) 


2.52 (0.11) 


TN2 


2.56 (0.07) 


t 2.84 (0.08v 


t 2.92 (0.10) 


3.02 (0.05) 


2.03 (0.02) 


2.64 (0.05) 


simpleO 


t 3.54 (0.11) 


t 2.78 (0.10) 


t 3.05 (0.11) 


t 5.58 (0.13) 


t 3.93 (0.13) 


2.83 (0.10) 


simplelOOO 


t 4.00 (0.19) 


2.61 (0.27) 


t 3.27 (0.12) 


t 5.94 (0.17) 


t 3.81 (0.24) 


2.61 (0.18) 


mofnO 


t 6.74 (0.54) 


t 7.62 (0.75) 


t 9.47 (0.59) 


t 10.23 (0.49) 


t 9.35 (0.49) 


t 5.46 (0.41) 


mofnlOOO 


t 7.49 (0.48) 


t 10.10 (0.91) 


t 10.17 (0.41) 


t 10.73 (0.32) 


t 23.41 (1.51) 


t 9.04 (0.90) 




(jA: Ug = 2' 


0, rii = 20, ti = 10, Pc = 0.25, = O.Ul 


SingleRule 


2.78 (0.11) 


t 2.97 (0.14) 


2.83 (0.08) 


3.73 (0.16) 


2.87 (0.09) 


2.55 (0.11) 


SingleRulePP 


2.54 (0.08) 


2.25 (0.27) 


2.75 (0.12) 


2.96 (0.09) 


2.94 (0.14) 


2.28 (0.16) 


MultipleRules 


t 2.79 (0.10) 


t 3.00 (0.16) 


2.89 (0.11) 


t 3.79 (0.08) 


2.91 (0.07) 


2.54 (0.11) 


MultipleRulesPP 


t 2.79 (0.10) 


t 2.96 (0.16) 


2.88 (0.11) 


t 3.79 (0.08) 


2.95 (0.08) 


2.47 (0.09) 




(jA: Ug = 40, rii = 15, U = 10, Pc = 0.4, Pm = 0.01 


*SingleRule 


2.65 (0.13) 


t 2.86 (0.15) 


2.78 (0.10) 


t 3.83 (0.07) 


2.94 (0.11) 


2.59 (0.12) 


^SingleRulePP 


2.46 (0.08) 


2.41 (0.20) 


2.75 (0.08) 


3.14 (0.09) 


2.91 (0.05) 


2.31 (0.08) 



Table 8. Methods ordered by infidelity rate. 





breast 


crx 


heart 


pima 


sonar 


votes 


T~ 


SingleRule^ 


C4.5‘ 


mofnlOOO^ 


mofnlOOO^ 


♦SingleRulePP^ 


simplelOOO 




*SingleRule^ 


C4.5rules^ 


♦SingleRule^ 


MultipleRules^ 


SingleRule^ 


CN2 


"3~ 


CN2* 


MultipleRules^ 


C4.5rules^ 


MultipleRulesPP^ 


simplelOOO 


SingleRule 




MultipleRules^ 


MultipleRulesPP^ 


♦SingleRulePP 


♦SingleRulePP 


CN2 


mofnlOOO 




MultipleRulesPP^ 


simpleO^ 


SingleRule^ 


C4.5rulesl 


C4.5 


C4.5rules 




SingleRulePP^ 


SingleRule 


mofnO 


♦SingleRule 


♦SingleRule 


♦SingleRule 




C4.5rules 


♦SingleRule 


simplelOOO 


SingleRule 


C4.5rules 


MultipleRulesPP 


"8~ 


♦SingleRulePP^ 


mofnO^ 


SingleRulePP 


C4.5 


simpleO 


MultipleRules 




mofnO 


SingleRulePP 


MultipleRules 


simplelOOO 


SingleRulePP 


mofnO 




C4.5 


mofnlOOO 


MultipleRulesPP 


CN2 


MultipleRulesPP 


C4.5 


TT 


mofnlOOO 


♦SingleRulePP 


CN2 


SingleRulePP 


MultipleRules 


CN2 




simpleO 


simplelOOO 


C4.5 


mofnO 


mofnO 


SingleRulePP 


~13 


simplelOOO 


CN2 


simpleO 


simpleO 


mofnlOOO 


♦SingleRulePP 



— For the heart data set, the methods *SingleRule and mofnlOOO obtained 
the best results for the infidelity rate. However, for the number of induced 
rules, the method *SingleRule did not obtain good results, and the method 
mofnlOOO did not obtain good results for the mean number of conditions per 
induced rule. 

— For the pima data set, the methods *SingleRulePP and C4.5rules obtained 
good results for the syntactic complexity. For the infidelity rate, the maxi- 
mum number of significant results obtained by the methods was 1, thus none 
of them can be considered the best. 

— For the sonar data set, the method *SingleRulePP obtained very good 
results for the infidelity rate and the number of induced rules. 

— For the votes data set, none of them presented significant results for the 
infidelity rate. However, the methods SingleRulePP, *SingleRulePP and 
C4.5rules presented good results for the syntactic complexity. 
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Table 9. Methods ordered by number of induced rules. 





breast 


crx 


heart 


pima 


sonar 


votes 


T~ 


mofnlOOO^^ 


SingleRulePP^^ 


mofnlOOO^^ 


SingleRulePP^^ 


mofnlOOO^^ 


SingleRulePP® 




SingleRulePP® 


*SingleRulePP^® 


SingleRulePP^® 


C4.5rules^ 


mofnO® 


C4.5rules® 




mofnO^ 


C4.5rules^ 


mofnO^ 


SingleRule® 


SingleRulePP^ 


♦SingleRulePP® 


T~ 


*SingleRulePP^ 


mofnlOOO^ 


♦SingleRulePP^ 


♦SingleRulePP® 


C4.5rules^ 


mofnlOOO® 




C4 . 5 rules'* 


mofnO^ 


C4.5rules^ 


mofnO^ 


simplelOOO® 


mofnO® 




C4.5'* 


CN2^ 


CN2 


CN2^ 


♦SingleRulePP^ 


04.5* 




simpleO^ 


C4.5^ 


C4.5 


C4.5 


simpleO® 


simplelOOO 


"8~ 


simplelOOO^ 


SingleRule^ 


simpleO 


mofnlOOO^ 


C4.5** 


MultipleRulesPP^ 




CN2 


MultipleRulesPP^ 


MultipleRulesPP 


simpleO^ 


SingleRule^ 


CN2 


~To 


MultipleRules 


MultipleRules^ 


MultipleRules 


simplelOOO 


MultipleRulesPP^ 


simpleO 


TT 


MultipleRulesPP 


♦SingleRule^ 


simplelOOO 


MultipleRules 


MultipleRules 


MultipleRules 


T 2 


SingleRule 


simplelOOO 


SingleRule 


MultipleRulesPP 


CN2 


♦SingleRule 


~13 


*SingleRule 


simpleO 


♦SingleRule 


♦SingleRule 


♦SingleRule 


SingleRule 



Table 10. Methods ordered by average number of conditions per induced rule. 





breast 


crx 


heart 


pima 


sonar 


votes 




♦SingleRulePP^ 


C4.5rules® 


C4.5rules® 


C4.5rules® 


CN2‘* 


SingleRulePP^ 




SingleRulePP^ 


SingleRulePP^ 


♦SingleRulePP^ 


SingleRulePP® 


C4.5rules® 


♦ S ingl eRul ePP ^ 


3~ 


CN2‘* 


♦SingleRulePP^ 


SingleRulePP^ 


CN2® 


SingleRule^ 


MultipleRulesPP® 


T~ 


C4.5rules^ 


simplelOOO^ 


♦SingleRule^ 


♦SingleRulePP® 


MultipleRules^ 


C4.5rules® 




♦SingleRule^ 


C4.5^ 


SingleRule^ 


SingleRule® 


♦SingleRulePP^ 


MultipleRules^ 




SingleRule® 


simpleO^ 


MultipleRulesPP^ 


MultipleRules® 


♦SingleRule^ 


SingleRule® 




MultipleRules® 


CN2^ 


MultipleRules^ 


MultipleRulesPP® 


SingleRulePP^ 


♦SingleRule® 


8~ 


MultipleRulesPP® 


♦SingleRule^ 


aM2'* 


♦SingleRule® 


MultipleRulesPP^ 


simplelOOO^ 


9~ 


simpleO® 


MultipleRulesPP^ 


simpleO^ 


simpleO® 


simplelOOO^ 


simpleO® 


~10 


C4.5** 


SingleRule^ 


C4.5^ 


C4.5^ 


simpleO® 


CIM2^ 


TT 


simplelOOO^ 


MultipleRules^ 


simplelOOO^ 


simplelOOO^ 


C4.5^ 


C4.5^ 


T 2 


mofnO 


mofnO^ 


mofnO 


mofnO 


mofnO^ 


mofnO^ 




mofnlOOO 


mofnlOOO 


mofnlOOO 


mofnlOOO 


mofnlOOO 


mofnlOOO 



Finally, analyzing the results in a general way, it is possible to conclude that 
the use of GAs for extracting comprehensible knowledge from AN Ns is promising 
and should be explored in more deep. One of the aspects that should be better 
investigated refers to the good results obtained by the strategy SingleRule com- 
pared to the strategy MultipleRules. Apparently, a GA builds better classifiers 
considering the best rule to classify a new instance, instead of considering all 
the rules that cover an instance and to decide the classification of this instance 
based on the global quality of these rules. 



4 Conclusion 



In this work, we propose a method based on symbolic ML systems and GAs in 
order to extract comprehensible knowledge from AN Ns. The main advantage of 
the proposed method is that it can be applied to any supervised ANN. The use 




842 



C.R. Milare et al. 



of GAs allows the integration of the knowledge extracted by several symbolic M L 
systems in a single set of rules. This set of rules should have a high fidelity with 
the ANN due to the fact that this set of rules will be used to explain the ANN. 
This task is not trivial, since in order to obtain a high degree of fidelity, the rules 
have to complement themselves. In the experiments carried out, the proposed 
method achieved satisfactory results, and should be further explored. 

Several ideas for future works will be evaluated, such as: 

— More symbolic ML systems can be used to increase the diversity of rules. 
Another strategies are to vary the parameters of these systems and to induce 
classifiers on different samples. 

— In the current implementation, we chose to generate the individuals of GA by 
randomly selecting rules from the whole set of classifiers. As the individuals of 
the initial population are not necessarily “good” classifiers, there is a higher 
chance of stopping in a local maxima. In this work, we opted for building 
the initial population randomly, with the objective of verifying the potential 
of the proposed method without favoring the GA. We intend to investigate 
the behavior of the GA when each initial individual of the population is a 
“good” classifier. This can be accomplished by using the induced classifiers 
as initial individuals. 

— Another important aspect regarding the set of generated rules is that this 
set should cover an expressive region of the instance space. In other words, 
it is of little use to have a set of rules in which each individual rule is highly 
fidel to an ANN if all these rules cover the same instances. If a set of rules 
has several redundant rules, then several instances might be classified by the 
default rule. In the current implementation, the default rule classifies every 
instance as belonging to the majority class. A strategy to create individuals 
with complementary covering rules is to introduce this information in the 
GA fitness function. 
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Abstract. Novelty detection in time series is an important problem 
with application in different domains such as machine failure detection, 
fraud detection and auditing. In many problems, the occurrence of short 
length time series is a frequent characteristic. In previous works we have 
proposed a novelty detection approach for short time series that uses 
RBF neural networks to classify time series windows as normal or nov- 
elty. Additionally, both normal and novelty random patterns are added 
to training sets to improve classification performance. In this work we 
consider the use of MLP networks as classifiers. Next, we analyze (a) the 
impact of validation and training sets generation, and (b) of the training 
method. We have carried out a number of experiments using four real- 
world time series, whose results have shown that under a good selection of 
these alternatives, MLPs perform better than RBFs. Finally, we discuss 
the use of MLP and MLP/RBF committee machines in conjunction with 
our previous method. Experimental results shows that these committee 
classifiers outperform single MLP and RBF classifiers. 



1 Introduction 

Novelty detection - the process of finding novel patterns in data - is very im- 
portant in several domains such as computer vision, machine fault detection, 
network security and fraud detection [14,5]. A novelty detection system can be 
regarded as a classifier with two possible outcomes, one for normal and the other 
for novelty patterns. However, in most cases, there is only normal data available 
to train the classifier [14,5]. Hence, novelty detection systems must be properly 
designed to overcome this problem. 

The behavior of many systems can be modeled by time series. Recently, the 
problem of detecting novelties in time series has received great attention, with a 
number of different techniques being proposed and studied, including techniques 
based on time series forecasting with neural networks [8,9], artificial immune 
system [5], wavelets [13] and Markov models [7]. These techniques have been 
applied in areas such as machine failure detection [5] and auditing [8,9]. 
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Forecasting-based time series novelty detection has been criticized because 
of the not so good performance [7,5]. Alternatively, a number of classification- 
based approaches have been recently proposed [5,13,7]. However, none of them 
is devoted to short time series. This has motived us to design a method devoted 
to short time series novelty detection [10]. Our method is designed to classify 
time series windows as normal or novelty. It is based on the negative samples 
approach, which consists of generating artificial samples to be used to represent 
novelty [14,5]. In order to improve classification performance on test sets, our 
method also adds artificial normal samples to the training sets. We have used 
RBF neural networks trained with the DDA algorithm as our classifier [2]. 

In this work we study how the use of different classifiers impact the system 
performance. The classifiers considered are Multi-Layer Perceptrons neural net- 
works (MLPs), RBFs, committees of MLPs and committees of MLP and RBF 
networks. Training MLPs requires the use of some method to avoid overfitting. 
We consider two approaches: early stopping [6] and weight decay with Bayesian 
regularization [4,6] . In order to use early stopping it is necessary to further divide 
data available for training into training and validation sets. We also evaluate two 
ways of creating training and validation sets from time series. 

Next section presents the proposed approach and the alternative methods 
to generate training and validation sets. Section 3 presents the classification al- 
gorithms considered in this work. Section 4 describes experiments carried out 
with real-world time series in order to compare the performance of the classi- 
fiers. Finally, in section 5 conclusions and suggestions for further research are 
presented. 

2 The Proposed Approach 

Our novelty detection system works by classifying time series windows as normal 
or novelty [10]. The system requires fixed length windows, with window size w. 
A window is formed by w consecutive datapoints extracted from the time series 
under analysis. The first training pattern will have the first w datapoints from 
the time series as its attributes values. To obtain the second pattern we start with 
the second datapoint and use the next w datapoints. The remaining patterns are 
obtained by sliding the window by one and taking the next w datapoints. So, 
if we have a time series with I datapoints and use window size w, we will have 
I — w + 1 patterns. These patterns will later be separated to obtain training and 
test sets. 

Given a window from the time series, the idea is to define an envelope around 
it as shown in figure 1 . Any time series window with all values inside the envelope 
is considered normal. Windows with points outside the envelope are considered 
novelty. We use a threshold pi to define the envelope. Normal patterns are de- 
fined by establishing the maximum percent deviation pi above and below each 
datapoint of a given original pattern. 

We suppose that the time series represent the normal behaviour and so the 
training set will have only normal patterns. Thus, in order to train a classifier for 
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Fig. 1. Definition of normal and novelty regions. 



the novelty detection task, we need to generate novelty random patterns. These 
patterns are windows with values in the novelty regions shown in figure 1. We 
need to generate sufficient random patterns for each window in the series, in 
order to represent adequately the novelty space. We also generate a number of 
normal random patterns whose datapoints are inside the envelope. In order to 
improve classification performance, random patterns should be added in a way 
that the resulting data set have equal numbers of normal and novelty patterns 
[6]. The training set with the original patterns and the random normal and 
random novelty patterns is called augmented training set. 

In many problems of interest, such as auditing, where novelty is related to 
possibility of fraud, we are mainly interested in testing network performance 
in detection of patterns whose deviation from normality is not too big. So, we 
also define a second threshold p 2 to limit the novelty regions. For example, 
in the experiments presented below, we use pi = 0.1 and P 2 = 0.5, meaning 
that patterns whose attributes are at most 10% from the normal pattern are 
considered normal and patterns whose attributes deviates from a normal pattern 
from 10% to 50% are considered novelty or fraudulent patterns. After generating 
the augmented training set, we train a classifier to discriminate normal and 
novelty windows in the time series. 



2.1 Generation of Training and Validation Sets 

Early stopping is a common method used to avoid overfitting in neural network 
training [6], specially with MLP neural networks. In this technique, data avail- 
able for training is further divided into training and validation sets. In time series 
forecasting it is common to separate data into training, validation and test sets 
using the natural time order [15,6]. If we use this approach in our system, aug- 
mented training, validation and test sets would be generated from the respective 
training, validation and test sets. 
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We argue that dividing data in time order means loosing important infor- 
mation for training. We propose a new form of division that we call distributed 
division. In distributed division the original time series is also divided into train- 
ing, validation and test sets in time order. However, a number of the random 
patterns generated from the training set are added to form the augmented val- 
idation set and vice versa. In this way, the resulting augmented training and 
validation sets will have information from all the period available for training. 

Generation of augmented training and validation sets in distributed division 
works as follows: 

1. Divide the original time series into disjoint training and test periods in the 
natural time order; 

2. For each set, generate normal patterns with window size w length, according 
to the procedure previously stated; 

3. For each normal pattern available in the training set, generate n normal ran- 
dom patterns, according to the criterion previously stated. Put a percentage 
of these patterns in the augmented training set and the remaining patterns 
in the augmented validation set; 

4. For each normal pattern available in the training set, generate n -I- 1 nov- 
elty random patterns, according to the criterion previously stated; Put a 
percentage of these patterns in the augmented training set and rest in the 
augmented validation set. 

3 The Classification Algorithms 

This work aims at studying a number of alternative classifiers for the proposed 
short time series novelty detection system. In a previous paper we have used 
RBFs trained with the DDA [10]. In this work we consider Multi-Layer Percep- 
tron (MLP) neural networks and committee machines formed either by MLPs 
only or by MLP and RBF-DDA. The number of inputs for each classifier cor- 
responds to the window size w. Each classifier has two outputs, one used to 
indicate normal patterns and the other, novelties. All networks have a single 
hidden layer. 

3.1 Radial Basis Functions Networks RBFs 

The DDA algorithm is a constructive algorithm used to build and train RBF 
networks [2]. It does not need a validation set and so, all training data can be 
more effectively used for training. RBF-DDA has often achieved classification ac- 
curacy comparable to MLPs trained with Rprop [12] but training is significantly 
faster. 

3.2 Multi-layer Perceptrons MLPs 

The second kind of classifiers considered in this work are Multi-Layer Percep- 
tron (MLP) neural networks. We have used four alternative MLP classifiers: 1) 
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MLP trained with Rprop using time order validation sets; 2) MLP trained with 
Rprop using distributed validation sets; 3) MLP trained with RpropMAP using 
distributed validation sets; 4) A committee machine with MLPs trained with 
RpropMAP. 

The two first MLP classifiers are trained with the resilient hackpropagation 
(Rprop) [12] and early stopping with the GL^ criterion from Probenl [11]. 

The third MLP classifier is trained with Rprop with adaptive weight decay 
(RpropMAP) [4]. This is an extended version of the Rprop algorithm that uses 
regularization by adaptive weight decay [6]. The weighting parameter A for the 
weight-decay regularizer is computed automatically within the Bayesian frame- 
work, during training. RpropMAP has two parameters besides those of Rprop: 
the initial weighting A of the weight decay regularizer and the update frequency 
of the weighting parameter. In our experiments we have used the default values 
for these parameters, which are 1 and 50, respectively. 

RpropMAP does not need a validation set, which is a very important advan- 
tage for our problem. All data available for training are effectively used to adjust 
the network weights. Training is carried out in two phases. The first phase uses a 
validation set in order to discover the number of epochs to be used in the second 
phase. The second phase uses the full training set. In the experiments we start 
by dividing training data into training and validation sets using distributed vali- 
dation. Next, we randomly initialize the network weights and bias and train it by 
using RpropMAP and early stopping with the GL^ criterion. At the end of train- 
ing, we take note of the number of epochs needed, Cmax- Finally, we use the same 
random weights and bias initialization and train the network, this time with the 
full training set. Training stops when the number of epochs reaches Cmax- 

The fourth MLP classifier is a committee machine of MLP classifiers. Exper- 
iments with the third classifier are carried out ten times with different random 
initializations of weights and bias. Next, we obtain the mean and standard devi- 
ation of the classification error across executions. On the other hand, the fourth 
MLP classifier uses ensemble mean to compute the classification error [6]. In this 
case we will have, for each augmented training set, ten MLPs with two output 
neurons trained with RpropMAP, as described previously. For each MLP j, the 
value of output i is given by yij. Given a pattern p in the test set, we compute 
the output i of the committee classifier as Vij- Next, pattern p is classified 

according to the winner-takes-all criterion [6]. This is done for each pattern in 
the test set, in order to compute the classification error in this set. 



3.3 MLP/RBF Committee 

MLP and RBF networks have different characteristics, mainly due to their dif- 
ferent activation functions [6]. The experiments carried out in this work show 
that these networks have different performance regarding false positive and false 
negatives. This has motivated us to propose the use of a MLP/RBF commit- 
tee in order to further improve the novelty detection system performance. The 
committee uses RBF trained with DDA and MLPs trained with RpropMAP. 
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For each augmented training set we need to train the RBF only once because 
DDA does not depends on weights initialization. On the other hand, we need 
ten MLPs for each augmented training set. RBF-DDA outputs can be greater 
than 1. MLPs using sigmoid logistic activation functions produce outputs from 
0 to 1. So, in order to integrate the information from these classifiers we must 
transform their results. We use the s o/tmax transformation for this propose [6]. 
Firstly, we combine the results of MLPs trained with RpropMAP to form the 
MLP committee described previously. Next, we apply the softmax transforma- 
tion gi = to each committee output i. The softmax transfor- 

mation is also applied independently to the outputs of the trained RBF-DDA. 
Finally, the softmax transformed outputs of the MLP committee and of the 
RBF are combined to obtain an ensemble mean. The winner-takes-all criterion 
is applied for classification. 



4 Experiments 

We performed some experiments using real-world data in order to compare the 
performance of the different classifiers on our novelty detection approach. The 
time series used in the experiments have been used in our previous work with 
RBF-DDA as classifiers [10]. The series are shown in figure 2. All have 84 values 
corresponding to the months from January 1996 to December 2002. The first 
series was extracted from a real payroll and was used previously in a study of a 
payroll auditing system based on neural networks forecasting [9] . The remaining 
series are sales time series with values in the same period of the first series. Each 
series had their values normalized between 0 and 1. 

It is clear that these series are non-stationary. So, in order to use a focused 
TLFN {Time Lagged Feedforward Networks), it is important to pre-process the 
time series in order to work with their stationary versions [6]. We have used 
the classic technique of differencing to obtain stationary versions of the time 
series [3]. For each original time series {xi, . . . , a;Ar} a differenced time series 
{y 2 , • • ■ j Vn} was formed by yt = Xt~ Xt-i- Note that the differenced time series 
does not have the first datapoint. In fact, we have shown in our previous work 
that differencing the time series has a great influence on novelty detection with 
RBF-DDA [10]. Hence, in this study we consider only differenced versions of 
time series. 

We have used a window size w = 12. The patterns are generated from each 
time series according to the procedure described in section 2. For differenced time 
series we will have 72 patterns with 12 attributes each. We have used the last 12 
patterns as the test set and the remaining patterns as training set. It is important 
to emphasize that normalization is carried out only after the generation of the 
random patterns. 

The normal and novelty regions were defined using thresholds pi = 0.1 and 
P 2 = 0.5. For each original pattern we have added 9 normal random patterns 
and 10 random novelty patterns. With this, our training and test sets increases 
by a factor of 20, having, respectively, 1200 and 240 patterns for differenced 
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(a) Series 1: earning from a Brazilian pay- (b) Series 2: sales from a Brazilian corn- 
roll. pany. 





(c) Series 3: USA computer and software (d) Series 4: USA hardware stores sales, 
stores sales. 



Fig. 2. Time series used in the experiments. Series 3 and 4 are available at the URL 
http:/ /www. census.gov/mrts/www/mrts.html. 



time series. For classifiers that need a validation set, we further divide training 
data into training and validation sets using either time order or distributed 
validation, according to section 2. In both cases, the training set will have 80% 
of the patterns (960) and the validation set will have 20% of them (240 patterns). 

4.1 Networks Topologies and Training 

The number of hidden units in the RBF classifier is obtained automatically 
during training by the DDA algorithm [2] . We have used MLPs with one hidden 
layer. The number of units in the hidden layer has a great impact on classifier 
performance. For MLPs, experiments were carried out with 2, 6, 12, 18, 24, 36 
and 48 units in the hidden layer. We present only the results corresponding to 
the topology that has performed better. 

For each time series we have trained the network with ten different versions of 
the augmented training set generated from different seeds, to take into account 
the variability of the random patterns added to form them. DDA algorithm for 
RBF training is constructive and its result does not depend on weights initializa- 
tion. Thus, for each augmented training set, experiments are executed only once. 
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On the other hand, MLPs performance depends on weights and bias initializa- 
tions, therefore, in this case, for each augmented training set, we train and test 
the network ten times, to take into account the weight initialization influence on 
results. 

4.2 Results and Discussion 

Table 1 presents results obtained after training the networks for series 1, 2, 3, 
and 4. It contains the mean number of epochs used in training, the mean number 
of hidden units in the resulting network and the network performance on its test 
set, i.e., the classification error, the false alarm rate and the undetected novelty 
rate. A false alarms happens when the network classifies a normal pattern as 
novelty. An undetected novelty happens when a novelty pattern is misclassified. 

In table 1 MLP(l) means MLP trained with Rprop using time order valida- 
tion sets; MLP(2) means MLP trained with Rprop using our distributed vali- 
dation sets generation; MLP(3) means MLP trained with RpropMAP; MLP(4) 
means MLP committee with ensemble mean; MLP/RBF is the classifier built by 
combining MLP and RBF results. Finally, we also present our former RBF-DDA 
results for comparison [10]. 

Results show that the use of the proposed distributed validation set approach 
improves MLP performance on test sets for all series considered. It can also be 
seen that using all data available for training to form the training set and train- 
ing with RpropMAP greatly improves performance. Results also show that the 
committee of MLPs and the MLP /RBF committee further improve classification 
performance on test sets, for all time series considered. These committee ma- 
chines outperform both MLP and RBF-DDA. RBF-DDA obtained 8.35% mean 
classification error across the four series, while the MLP committee and the 
MLP/RBF committee obtained, respectively, 2.24% and 3.03%. 

The MLP and MLP/RBF committees have produced comparable results for 
the classification error. MLP committee outperformed MLP/RBF committee in 
two series and produced worse results for the remaining two series. However, 
these classifiers produced very different results regarding false alarm and unde- 
tected novelty rates. RBF-DDA and the MLP/RBF committee have produced 
0% false alarm on all experiments. On the other hand, pure MLP classifiers tend 
to produce false alarm rates greater than undetected novelty rates. This is due 
to the different nature of MLPs and RBFs. MLPs build global input-output 
mappings while RBFs build local mappings [6]. 

5 Conclusions 

In this work we have analyzed a number of alternative classifiers to be used 
in conjunction with our method for novelty detection in short time series [10]. 
Our method works by generating both novelty and normal random patterns and 
adding than to training sets in order to improve classification performance. In 
this paper we compare performance of RBF-DDA, MLPs and committee ma- 
chines of these classifiers. The classifiers were compared using four real-world 
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Table 1. Performance of the novelty detection approach on test sets for each time 
series 



Classifier 


Epochs 


Hidden 

Units 


Class, error 


False alarm rate 


Undetec. 


novelty rate 


mean s.dev 


mean 


s.dev 


mean 


s.dev 


Series 1 


MLP (1) 


224.10 


6 


30.08% 3.53% 


21.52% 


2.75% 


8.55% 


2.09% 


MLP (2) 


584.35 


24 


10.10% 1.68% 


8.60% 


2.08% 


1.50% 


1.44% 


MLP (3) 


3552.65 


48 


1.37% 0.56% 


0.91% 


0.44% 


0.46% 


0.24% 


MLP (4) 


3552.65 


48 


0.04% 0.13% 


0.00% 


0.00% 


0.04% 


0.13% 


MLP/RBF 


3552.65/4 


48/623.2 


3.25% 1.33% 


0.00% 


0.00% 


3.25% 


1.33% 


RBF-DDA 


4 


623.2 


11.00% 2.86% 


0.00% 


0.00% 


11.00% 


2.86% 


Series 2 


MLP (1) 


150.10 


12 


33.17% 1.57% 


22.86% 


1.96% 


10.31% 


1.93% 


MLP (2) 


988.75 


12 


22.41% 1.56% 


19.80% 


2.01% 


2.61% 


0.97% 


MLP (3) 


7033.55 


36 


7.09% 2.26% 


6.44% 


2.21% 


0.65% 


0.34% 


MLP (4) 


7033.55 


36 


3.58% 2.42% 


3.37% 


2.34% 


0.21% 


0.41% 


MLP/RBF 


7033.55/4 


36/621.7 


3.08% 1.27% 


0.00% 


0.00% 


3.08% 


1.27% 


RBF-DDA 


4 


621.7 


6.50% 1.86% 


0.00% 


0.00% 


6.50% 


1.86% 



Series 3 



MLP (1) 


450.55 


12 


29.90% 1.41% 


25.24% 


1.95% 


4.66% 


0.90% 


MLP (2) 


710.95 


24 


29.26% 3.00% 


26.65% 


3.28% 


2.61% 


0.92% 


MLP (3) 


7869.75 


48 


8.35% 2.88% 


6.78% 


2.74% 


1.57% 


0.38% 


MLP (4) 


7869.75 


48 


4.54% 3.29% 


3.71% 


2.90% 


0.83% 


0.52% 


MLP/RBF 


7869.75/4 


48/656.1 


3.04% 0.92% 


0.00% 


0.00% 


3.04% 


0.92% 


RBF-DDA 


4 


656.1 


8.79% 1.93% 


0.00% 


0.00% 


8.79% 


1.93% 


Series 


/ 


MLP (1) 


205.00 


48 


20.36% 3.22% 


13.43% 


1.86% 


6.93% 


2.53% 


MLP (2) 


640.45 


48 


11.46% 2.71% 


8.48% 


2.87% 


2.98% 


1.03% 


MLP (3) 


6948.30 


36 


2.09% 0.58% 


1.02% 


0.47% 


1.07% 


0.45% 


MLP (4) 


6948.30 


36 


0.79% 0.69% 


0.21% 


0.35% 


0.58% 


0.53% 


MLP/RBF 


6948.30/4 


36/644.4 


2.75% 1.09% 


0.00% 


0.00% 


2.75% 


1.09% 


RBF-DDA 


4 


644.4 


7.13% 1.32% 


0.00% 


0.00% 


7.13% 


1.32% 



non-stationary time series and the experiments have shown that the machine 
committees formed by MLPs and the MLP/RBF machine committees achieve 
similar performance, with 2.24% and 3.03% mean classification errors, respec- 
tively. This represent a considerable improvement over RBF-DDA, which has 
produced a mean classification error of 8.35% across these series. [10]. 

The classifiers have been compared using cyclic non-stationary time series 
which appear in many important problems, such as auditing [10]. However, we 
are aware of the importance of assessing system performance on other kinds 
of time series. Our future works include studying the impact of the window 
size on system’s performance and applying neural networks with more powerful 
temporal processing abilities such as TDRBF [1], TDNN, FIR and recurrent 
networks [6]. They will be used to classify directly non-stationary time series 
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and will be applied in conjunction to the method proposed here on real auditing 
problems, such as accountancy auditing [8] and payroll auditing [9]. 
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Abstract. This work belongs to the field of hybrid systems for Artificial 
Intelligence (AI). It concerns the study of "gradual" rules, which makes it 
possible to represent correlations and modulation relations between variables. 
We propose a set of characteristics to identify these gradual rules, and a 
classification of these rules into "direct" rules and "modulation" rules. In 
neurobiology, pre-synaptic neuronal connections lead to gradual processing and 
modulation of cognitive information. While taking as a starting point such 
neurobiological data, we propose in the field of connectionism the use of 
"Sigma-Pi" connections to allow gradual processing in AI systems. In order to 
represent as well as possible the modulation processes between the inputs of a 
network, we have created a new type of connection, "Asymmetric Sigma-Pi" 
(ASP) units. These models have been implemented within a pre-existing hybrid 
neuro-symbolic system, the INSS system, based on connectionist nets of the 
"Cascade Correlation" type. The new hybrid system thus obtained, INSS- 
Gradual, allows the learning of bases of examples containing gradual 
modulation relations. ASP units facilitate the extraction of gradual rules from a 
neural network. 



1 Introduction 

Artificial intelligence (AI) very early concerned with the concept of learning which is 
the basis of all experience, with the goal to develop methods that allow machines to 
acquire some expert knowledge. Two main ways have been explored then exploited, 
that drove to knowledge systems of a very different natures: the symbolic method 
used at the basis of classic AI, and more recently the use of connectionism methods 
based on the implementation of artificial neuron networks (A.N.N.) permitting a more 
numeric type learning. First studied separately, these two approaches of learning have 
been combined in hybrid systems that integrate the symbolic and the neuronal while 
trying to pull the best part of every technique. 

One of the problems was the expression power of such systems. If the classic AI 
succeeded in the implementation of some methods that permit the representation and 
treatment of some high-level knowledge, it didn't happen the same in the 
connectionism AI that permitted the representation mainly of the rules of relatively 
low-level knowledge. On the other hand, the hybrid systems were limited by their 
expression power by the capacities of their neuronal part. 
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The neuro-sytnbolic hybrid systems (NSHS) as K.BA.N.N. [18], SYNHESYS [7] 
or INSS [9], generally use a set of symbolic rules of order 0 or 0-H of the type: 

"IF <premise 1> AND/OR <premise 2>, THEN conclusion" 

"IF attribute/variable (>, <, =) AND/OR value..., THEN conclusion" (1) 

They do not permit the treatment of the knowledge a type employed by the experts, 
the knowledge called gradual: 

"The MORE (or LESS) x is A, the MORE (or LESS) y is B" (2) 

The main characteristics of the rules of this type are: the dynamic and gradual 
relation between the premise and the conclusion, and the imprecision of the treated 
knowledge. They have been much studied by the linguists that proposed to represent 
the graduality with the help of "topoi", relations that associate some gradual variables 
[11]. The logicians on their side used the theory of the possibilities and the fuzzy sets 
to figure out gradual rules equivalent to "topoi" ([2], [5]). Our approach to the 
problem remained near to the fuzzy logic. 

With regard to the implementation, the starting point of our work was the neuro- 
symbolic hybrid system INSS developed by Fernando Osorio [9]. INSS permits the 
compilation, the learning and the explicitation of classic rules. Our system is an 
extension of INSS. We called it INSS-Gradual (Gradual Incremental Neuro-Symbolic 
System). It contains a module connectionist that permits the learning of example 
bases possessing the gradual relations and modulation, and a second module that 
permits the gradual rule explicitation from an artificial neural network. The 
connectionist module of INSS-Gradual is provided with a new type of neuronal 
connection, the "Asymmetric Sigma-Pi" units, derived of the "Sigma-Pi" units 
proposed by Rumerlhart [13]. In the continuation of this article, we will describe the 
stages to the development of INSS-Gradual. 



2 Characteristics and Classification of Gradual Rules 

The gradual knowledge (Davis [3], Despres [4]) describes a continuous dynamic and 
singular relationship among some variables with an orderly domain: For example, the 
influence or modulation of a variable on another variable. This knowledge only 
indicates the address of the change and not its width. Symmetry does not exist: "x 
exercises influence on y", it does not imply that "y exercises influence on x" . During 
our work, we identify two kinds of gradual rules : 

■ Direct Gradual Rules (DGR) of the type "The MORE x is A, The MORE y is B". 

They describe the progressive and direct influence of the premise on the 
conclusion. For example: "An INCREASE of motor power, results in An 
INCREASE in fuel-consumption" , or "T power 'I, fuel-consumption" (where the 

symbols Tand 4; indicate "an increase" or "a decrease", respectively). 

■ Modulation Gradual Rules (MGR) of the type "The MORE x is A, The MORE 
[DGRf. In these, the premise exerts a modulation (inhibition or excitation) on 
the "force" of the DGR. For example: "The MORE the model of a car is old, The 
MORE [An INCREASE of the car weight, resulting in A DECREASE in the miles- 
per-gallon offuelf', or "(H-) model -I => [Tweight => -Imiles-per-gallon]. 
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In both cases, The MORE can be changed to The LESS, without forgetting that the 
relationships are not symmetrical, and that to change MORE for LESS can imply a 
different sense in the graduality relationship. 



3 Modelling of the Neurobiological Phenomena of Modulation 

The concept modulation present in the gradual rule has foundations in the 
neurobiological phenomena. According to Smyth [14], one of the means of 
integrating an activity in the cerebral cortex of mammals is the modulation of the 
transmission of information through "processors" present in the system. These 
"processors" accentuate or inhibit the consistency of information dealt with the one 
that, at the same time, is transmitted by the other "processors". This process is a useful 
strategy because the information of a variable of the environment can be transmitted 
through several other lines of treatment. If different lines of information are 
concerned with different variables of the environment, then several of these variables 
will be taken statistically into account. These statistical relations (or correlations) can 
be used for the discovery of some important variables of the environment, and 
especially of their reciprocal influences. 

In his works of modelling of the cortical column and analysis of its operating and 
learning laws, F. Alexander [1] identifies some behaviours of "modulation". He shows 
that "the columns receive and can modulate information coming from columns of 
other cortical areas or the same area, from other maxi-columns (set of columns that 
receive more than a receptive field)". 

This type of behaviour is the basis of some transmission phenomena of information 
putting in plays at least two input variables and an output variable, and in which an 
input modulates the completed treatment. The input charged of the modulation is 
called "variable that modulates" and the other input is called "variable modulated". 



3.1 The Behaviour of Modulation in a Neurobiological Connection 

In the classical neurobiological connections a neuron fires its signal on the dendrite of 
another neuron. However, there exists another type of connection corresponding to 
modulation phenomena: the pre-synaptic connections (see Fig. La). In this type of 
connection, the neuron that modulates establishes its discharge on a synapse and not 
on a dendrite: by this synapse, a neuron controls another synapse having an inhibition 
or excitation effect on the latter. 



3.2 The Sigma-Pi Units, a Way to Represent the Pre-synaptic Connections 

As previously said, the majority of neuro- symbolic hybrid systems only make the 
learning and/or the explicitation of symbolic rules of the type IF... THEN... For this, 
they use some A.N.N. based on the "Sigma" classic units (S units) where in each unit 
the sum pondered of outputs of other units is presented. In Sigma connection, the 
neural connection inhibits or excites a dendrite. This type of unit does not take into 
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account the behaviour of present modulation in the gradual rules. This is why we were 
interested in the "Sigma-Pi" units (SP units - see Fig. l.b). 

The "Sigma-Pi" units permit, for a particular unit, a multiplication of certain inputs 
before passing to the Sigma unit. These multiplicative units [13] allow an input to 
control the other unit or units. These units are a way to represent the neurobiological 
pre-synaptic connections where the effectiveness of the synapse between the axon 1 
and the dendrite would be modulated by the activity between the axon 1 and the axon 
2 [8]. However, the Sigma-Pi units do not take into account the characteristic of 
asymmetry that is present in the pre-synaptic connections and in the modulation 
behaviours of gradual rules: in the product of the x and y variable we do not know if x 
modulates y or the inverse. Multiplying the two variables we lose information. It is for 
that reason that we proposed the "Asymmetric Sigma-Pi" units (ASP units - see Fig. 
l.c). These units allow us to establish a relation of modulation between two inputs of 
the A.N.N. and at the same time to show what the input that modulates is. 

The "Asymmetric Sigma-Pi" units permit a “more natural” representation of the 
modulation relationships between the inputs of the A.N.N. without losing the 
corresponding information. These units were implemented in the neural module of 
INSS-Gradual. 



4 Learning of Gradual Rules with Asymmetric Sigma-Pi Units 

The ASP units were implemented in the neural module of INS S -gradual using the 
"Cascade-Correlation" (CasCor) paradigm [6]: To each phase of the learning, the 
neural module proposes a set of ASP-left and ASP-right candidate units, which will 
be exposed to a competitive learning using CasCor. Among them, the winning unit 
will be that that maximizes the correlation between the residual error of the global 
output and the output of the candidate unit. This winning unit will be incorporated 
into the net, and one will continue the learning of the base of examples. In the end, we 
will obtain an A.N.N. made up of S and ASP units (see Fig. 2). 
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Fig. 2. Neural network with a Sigma-Pi unit and an Asymmetric Sigma-Pi unit created for the 

neural module of INSS-Gradual used later on for the explicitation. 

We proved the neural module on two bases of examples (in the first one, we knew 

about the existence of gradual relationships, and in the second one, we ignored it): 

a) "Gradual Monk's Problem". (256 examples, 4 inputs, 1 output). This was created 
from "Monk's Problem" [16]. The task is to classify the entities in the gradual 
class robot (real value): (0,..., 6j. For that we used the attributes : angles-that- 
form-the-head (ordered by the number of angles) : {3, 4, 5, 6); body-size: {0, 1, 
2, 3}; smile (a bigger value indicates a defined smile: { 1, 2, 3, 4}) and brain-size: 
(0, 1, 2, 3). The Table 1 present the type of examples used. 

b) "Miles per Gallon Problem". (392 examples, 7 inputs, 1 output) [9]. The goal is 
to predict the performance in "miles per gallon" (mpg) of diverse cars starting 
from three qualitative attributes (cylinders-number, model, origin) and four 
quantitative attributes (displacement, power, weight, acceleration). 

After the learning process, the topologies of the neural networks are as follows: 

a) "Gradual Monk's Problem": three ASP units, of which two relate the attributes 
angles-that-form-the-head and body-size, and another unit that relates smile and 
brain-size. 

b) "Miles per Gallon Problem": 4 ASP units, of which two relate the attributes 
model and weight, another unit that relates origin with cylinders-number, and one 
that relates origin with acceleration, and lastly another for origin and number of 
cylinders. The type of neuron network created is shown in the Figure 2. 



5 Eclectic Explicitation of Gradual Rules 

After the learning process, the INSS-Gradual system allows an explicitation of new 
rules starting from the A.N.N. We consider that the graduality is present in the direct 
connections, and that the ASP units represent the modulation relationships. Therefore, 
we built the DGR by means of the analysis of the direct connections and the MGR by 
the analysis of the derivative of the A.N.N.'s global output with regard to the inputs of 
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the ASP unit. For this, a new method of eclectic explicitation [17] was implemented. 

The method combines two stages: an analysis of the interior of A.N.N. 

(decompositional stage) and an analysis of the outputs of A.N.N. (pedagogical stage): 

1) Decompositional. The weights of the direct connections are examined (see Fig. 
2). For each direct connection, we calculate the percentage that represents their 
weight with regard to all the other connections. For the most important 
connections, we build a DGR depending on the weight sign. In addition, the ASP 
units are analysed, and it is determined which of the inputs modulates, and which 
one is modulated. 

2) Pedagogical. For each ASP, we calculate the derivative of the global output of 
the neural network with relationship to the modulated input of the ASP 
relationship (5a/5x) (see Fig. 2). To build a MGR we analyse the sign of the 
derivative, the weight of the connection of the ASP unit toward the output (w„), 
the weight of the connection of the P unit toward the ASP unit (u„), the weight of 
the connection of the input modulated toward the output (w^), as well as the 
topology of the ASP unit. The relationship between w2 and the output will allow 
us to define the DGR (for example, "The MORE x, the MORE a" - see Fig. 2). 
The signs of the other weights and the derivative (w„, u„, and 5a/5x) will allow us 
to build the MGR completely (for example, "The MORE y, The MORE [DGR]" - 
see Fig. 2). 



5.1 Results of the Explicitation by Using the Eclectic Method 

The results of the explicitation on the two bases of examples are as follows: 

a) "Gradual Monk's Problem", 
a.l) Direct Gradual Rules: 

DGR1:T body-size=> T class-robot; 

DGR2:T smile OR T brain -size OR T angles-that-form-the-head => class-robot. 

a. 2) Modulation Gradual Rules: 

MGRl: (h-) angles-that-form-the-head T => (— ) [T body-size=> T class-robot]; 
MGR2: (-I-) angles-that-form-the-head => (— ) [T body-size=> T class-robot]; 
MGR3: (h-) smile T => (-t) [T brain -size => xL class-robot]. 

The explicitation allows us to recover the two MGR with which one had created 
the base of examples. 

b) "Miles per Gallon". 

b. l) Direct Gradual Rules: 

DGRl: (T model OR T acceleration OR T displacement) => T mpg; 

DGR2: (T weight OR T power OR T cylinders-number) => si mpg. 
b.2) Modulation Gradual Rules: 

MGRl: (h-) model T =i> (— ) [T weights si miles-per-gallon]; 

MGR2: (h-) model T => (— ) [T weight => si miles-per-gallon]', 

MGR3: (h-) origin T (— ) [T cylinders-number => si mpg]', 

MGR4: (h-) origin T =i> (— ) [T acceleration T mpg]. 

For "Miles per Gallon Problem", we observe that some rules are similar to those 
obtained by Thimm [15] who used an "intuitive and informal" explicitation method. 
For example, he obtained "The MORE weight, the LESS mpg" and "However, the 
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newer the cars are, the LESS they consume". These two rules are represented in the 
MGRl: "The MORE the model of a car is new, the LESS [An INCREASE of the car 
weight, resulting in A DECREASE in the miles-per-gallon of fuel]” or "(+) model 
T => (-) [T weight => sL miles-per-gallon]" . One of the advantages of our method is 
the automation of this process and the possibility of obtaining DGR and MGR. 



6 Conclusions 

The gradual rules permit the representation and the treatment of the gradual or 
modulation knowledge. During our research work, we identified their characteristics 
and their representation of these gradual rules. We proposed a classification of them 
in two kinds: direct gradual rules and modulation gradual rules. The first one permits 
the representation of an immediate gradual link between the premise and the 
conclusion of a rule. The second one corresponds just where a premise would exercise 
an effect of modulation on another already established relation. 

In neurobiology, the pre-synaptic synapses permit certain modulation effects that 
have a similar consequence to the behaviours that we named gradual or of 
modulation. Based on this neurobiological information, we proposed for the neural 
networks the units of a new type, the Asymmetric Sigma - Pi units, witch that permits 
the representation of the modulation processes and the learning of gradual knowledge. 
It drove us to the INSS-Gradual system. To extract this knowledge, we added a 
method of explicitation of gradual rules that permits the extraction of this rule type 
from an A.N.N. that includes the Asymmetric Sigma-Pi units. 

The INSS-Gradual system permits a “fine” analysis of the mechanisms for the 
treatment of high-level knowledge and the limits of these mechanisms in the 
automatic knowledge acquisition systems (see the problem of representation of 
qualitative variables in [12]). The new connectionist units that it implements permit a 
re-balancing between the two methods of learning combined in the classical hybrid 
system this while allowing to one and another method to take into account the same 
high-level knowledge type. 
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Abstract. This paper presents a method for determining sensing re- 
quirements for robotic assemblies from a geometrical analysis of critical 
contact-state transitions produced among mating parts during the exe- 
cution of nominal assembly plans. The goal is to support the reduction 
of real-life uncertainty through the recognition of assembly tasks that 
require force and visual feedback operations. The assembly tasks are de- 
composed into assembly skill primitives based on transitions described 
on a taxonomy of contact relations. Force feedback operations are de- 
scribed as a set of force compliance skills which are systematically asso- 
ciated to the assembly skill primitives. To determine the visual feedback 
operations and the type of visual information needed, a backward prop- 
agation process of geometrical constraints is used. This process defines 
new visual feedback requirements for the tasks from the discovery of di- 
rect, and indirect, insertion and contact dependencies among the mating 
parts. A computational implementation of the method was developed 
and validated with test cases containing assembly tasks including all the 
combinations of sensing requirements. The program behave as expected 
in every case. 



1 Introduction 

This paper presents a method for determining sensing requirements for robotic 
assembly from an analysis of critical contact-state transitions produced among 
assembled parts during the execution of nominal assembly plans. The goal is to 
support the reduction of real-life uncertainty, through the recognition of assembly 
tasks that require force and visual feedback. Force feedback is proposed to be 
used during the execution of assembly operations that involve objects in contact. 
Visual feedback is proposed to be used before the assembly operations to detect 
errors and deviations from the original plan that could require of preventive 
adjustments in the configurations of the mating parts [1]. Plans considered are 
restricted to binary plans, which are also linear and sequential, that describe a 
totally ordered sequence of assembly steps. 
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The information about the contacts formation is obtained by analyzing geo- 
metric models of the objects [2] , and are classified in accordance to a taxonomy 
used by Ikeuchi and Suehiro in the Assembly Plan from Observation (APO) 
method [3]. The assembly tasks are decomposed into assembly skill primitives 
based on transitions described on this taxonomy. 

The assembly parts considered in this work are rigid mechanical pieces mod- 
eled as polyhedral objects. Very small errors in the assembly operations involving 
objects in contact could generate great forces that could damage the assembly 
parts, the robot, and/or the workcell. Such operations are determined as in- 
cluding critical contact transitions that require force control [4]. Force feedback 
operations prescribed by the method for these operations are described as a set 
of force compliance skills which are systematically associated to the assembly 
skill primitives. 

To determine the visual feedback operations and the type of visual informa- 
tion needed, a backward propagation process of geometrical constraints is used. 
This process defines new visual feedback requirements for the tasks from the dis- 
covery of direct, and indirect, insertion and contact dependencies among the mat- 
ing parts. The method extends the approach proposed by Miura and Ikeuchi [5] 
to determine preventive visual feedback requirements for the environmental ob- 
jects (stationary objects that configure the environment during an assembly 
operation) including cases of multiple motions of the same object, tasks that do 
not modify the contact state, and tasks that break previous contact formations. 

2 Contact State Analysis 

The use of sensors and sensing operations is needed only in tasks where the 
amount of uncertainty with respect to some dimensions is big enough to put 
in risk its successful execution. Such dimensions are what we call the critical 
dimensions of an assembly task. In this paper, the dimensions of an assembly 
operation are defined as the least number of independent coordinates required 
to specify the pose of the manipulated object and the poses of the objects in the 
environment that participate in contact relations with the manipulated object. 

Contacts are directional phenomenons with a constraining motion effect over 
the manipulated object that depend on the contact’s direction and can be repre- 
sented by N ■ AT > 0, where N denotes the contact direction (constraint vector) 
and AT the possible translational motion vectors. The frictional resistance to 
motion generated among the contacting features is ignored by this analysis under 
the assumption of applying enough force to defeat the existent resistance. 

To represent the contact relations formed during the execution of an assem- 
bly plan, we use the taxonomy shown in Fig. 1. This taxonomy identifies all 
possible assembly relations based on the directions of the contact surface nor- 
mals. The contact directions and possible movement directions are represented 
on the Gaussian sphere. 

The three digits in the labels of the states in Fig. 1 denotes the number of 
maintaining DOF, detaching DOF, and constraining DOF, respectively. A main- 
taining DOF indicates that there is not a constraint component in that direction 
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Fig. 1. Contact-state relations taxonomy. 



and then a very small movement is not expected to modify the contact state. A 
detaching DOF indicates that a constraining component exists in that direction 
and then a conveniently selected motion can eliminate the contact. A constrain- 
ing DOF indicates that there is no possibility of movement in that direction. 

The changes in the contact-state relations that characterize a task are identi- 
fied by the transitions of DOF in the manipulated object. There are six possible 
types of transitions between DOF: maintaining to detaching (M2D), maintain- 
ing to constraining (M2C), detaching to constraining (D2C), detaching to main- 
taining (D2M), constraining to maintaining (C2M), and constraining to detach- 
ing (C2D). It is also possible that an assembly plan includes some steps that 
do not modify the category of any DOF of the manipulated object. These steps 
including maintaining to maintaining (M2M), detaching to detaching (D2D), 
and constraining to constraining (C2C) pseudo-transitions of DOF have to be 
considered since some of them could require of sensing information. 



3 Assembly Skill Primitives 

Every assembly task comprise some kind of motion. From an analysis on the 
effect of this motion over the manipulated object’s DOF, the following four 
assembly skill primitives were extracted: move (M) - an assembly skill prim- 
itive required by tasks including M2M and D2M DOF transitions to displace 
an object in a completely unconstrained manner; make-contact (C) - a move 
assembly skill primitive required by tasks including M2D DOF transitions that 
ends when a new contact is produced between the manipulated object and the 
environment; insert (I) - an assembly skill primitive required by tasks includ- 
ing M2C transitions to move the manipulated object into a low-tolerance region 
where the completely unconstrained DOF finishes completely constrained; and 
slide (S) - an assembly skill primitive required by tasks including the rest of the 
DOF transitions - D2D, D2C, C2M, C2D, and C2C - to move an object while 
maintaining the contact with at least one constraining surface (c-surface). 
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Fig. 2. Procedure graph including the assembly skill primitives. 



Taking the DOF transition analysis to the procedure graphs, it is realized 
that the execution of some assembly operations require the concurrent use of 
multiple assembly skill primitives. Figure 2 depicts the required skills for the 
assembly operations in the procedure graphs. 



4 Force Sensing Strategies 

To implement the assembly skill primitives, force and torque information is used 
to detect new contacts and to react to tactile stimuli. All the skills with exception 
of the move manipulation skill require of force compliance capabilities. The move 
assembly skill primitive requires position control, but not necessarily of any kind 
of sensing. 

From an analysis of the make-contact, slide, and insert assembly skill prim- 
itives three force compliance skills are recognized: detect-contact - a force 
compliance skill that moves an object until a new contact is produced against 
the assembly environment; configuration-control - a force compliance skill 
that corrects the configuration of the features in contact through rotations of 
the manipulated object; and keep-contact - a force compliance skill that moves 
an object while maintaining contact with constraining surfaces. 

Table 1 presents the force compliance skills needed to implement every as- 
sembly skill primitive. 



5 Visual Sensing Strategies 

The problem of determining the geometrical relations among objects in contact 
is inherently related to recognizing and locating objects in the scene. The typical 
way to detect contact relations among objects by vision is through a process of 
geometric reasoning and the use of threshold values. 
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Table 1. Force compliance skills required by each assembly skill primitive. 



Assembly skill 
primitive 


Force Compliance 
skill 


move 




make-contact 


detect-contact 

configuration-control 


slide 


keep-contact 


insert 


detect-contact 

keep-contact 

configuration-control 



Since the present work is interested in using vision with preventive intentions, 
the sensing planner has to take advantage of the periods when the manipulator 
has control over the assembly parts to perform some convenient adjustments in 
their pose configurations. The best sensor configuration to perform a proposed 
visual sensing strategy should be obtained by a sensor planner [6] . 

The analysis to identify the assembly skill primitives that require preventive 
use of vision is divided in two parts: one that analyses an assembly operation 
isolated from the rest of the plan, and another, that analyses the roll of an 
assembly operation as part of the full plan. In the first case, visual sensing 
requirements are determined considering only the new contact relations produced 
by the task when an object is manipulated. In the second case, visual sensing 
requirements are determined considering the contact relations produced by tasks 
where an object participates as part of the environment. 

5.1 Visual Sensing for a Manipulated Object 

Not all the assembly skill primitives require the use of preventive vision. Due to 
the assumption of bounded error, a move primitive is not expected to fail, and 
the cost of using vision is not considered worth its marginal benefit. Although 
vision could be of assistance to the make-contact primitive for discriminating 
among potential contact surfaces, determining the approach configuration, and 
evaluating subassembly stability risks; those situations are also considered of 
marginal benefit during the assembly planning. In addition, vision is not used 
also for the slide primitive because vision is not good to detect actual contacts, 
and then, it is not a good idea to use it for keeping contacts while moving. 

The insertion primitive is a low-tolerance operation where small errors typ- 
ically cause failure. Insertion fails when the inserting features enters in contact 
with faces adjacent to the inserting slot, instead of penetrating the target re- 
gion. This failure can be realized by monitoring the insertion process until a 
first contact is detected. Force and torque data can be used to determine success 
or failure. The absence of contacts before the insertion operations makes force 
feedback useless and vision adequate to verify the fulfillment of the alignment 
constraints imposed by the M2C DOF transitions. Consequently, an assembly 
step will require of using preventive vision on the manipulated object only if it 
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includes any of the following five of the 36 possible transitions: S ^ B, S ^ E, 
A ^ D, A ^ E, or B => E. 

5.2 Visual Sensing for an Environmental Object 

The sensing planner considers the use of vision in order to achieve two goals: 
first, to succeed in insertion operations because, as explained before, the success 
depends not only in the positional control of the manipulated object; it is also 
important that the environmental conditions for the insertion exist. Second, to 
succeed in producing all the prescribed contacts among the assembly elements 
during the execution of every operation. 

Sensing for insertion condition. To succeed in insertions, the insert features 
of the environment have to correspond and be aligned with the insert features of 
the manipulated object. When the environment includes insert features of one 
object, there is no need to add new visual sensing requirements for it; only the 
manipulated object’s configuration would have to be adjusted to compensate 
for any observed error with respect to the environmental object’s pose. Instead, 
when the environment includes insert features of two or more objects, an indirect 
insert dependency is defined among them. If any of these environmental objects 
was pre-assembled, it is possible that new visual sensing requirements need to be 
included into its preventive visual sensing strategy. These sensing requirements 
must be fulfilled when this object is manipulated. 

Sensing for contact prescription. In the present work, the execution of 
an assembly operation is considered successful if all the expected contacts are 
achieved. To achieve all the expected contacts the configuration of the contacting 
features on the environment has to correspond with the configuration of the 
contacting features on the manipulated object. When the environment includes 
features of one object, there is not need to add new visual sensing requirements 
for it; only the manipulated object’s configuration would have to be adjusted, 
using force sensing, to compensate for any sensed error in the force and torque 
patterns. Instead, when the environment includes contact features of two or more 
objects, an indirect contact dependency is defined among them. If any of these 
environmental objects was pre-assembled, it is possible that new visual sensing 
requirements need to be included into its preventive visual sensing strategy. 
These sensing requirements must be fulfilled when this object is manipulated. 

5.3 The Insert and Contact Dependency Graph 

An insert and contact dependency graph (ICdg) is introduced to express direct 
and indirect dependencies among assembly elements caused by make-contact 
and insert assembly skill primitives. An ICdg is a graph where nodes represent 
objects and arcs represent alignment constraints extracted from an analysis of 
insert relations and contact relations resulting from operations described into a 
nominal assembly plan. The direction of the arcs answer to the assembly order, 
and then, describes a dependency of one object’s configuration (the arc’s source) 
on the configuration of another (the arc’s target). The root arrowed node is the 
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current manipulated object, while the end arrowed nodes are environmental 
objects. 

The ICdg is used, during the sensing planning stage, as a tool for determining 
the critical dimensions for the manipulation of objects. The ICdg is also used, 
during the sensing execution stage, for adjusting the observed configurations 
of the objects to conform with all the alignment constraints defined by past, 
current, and future contact-state relations. 




Fig. 3. ICdg. 



Elements of an ICdg 

An ICdg could include two types of nodes and three types of arcs. One type of 
nodes is added for each environmental object that is not moved during the full 
assembly process. A second type of node is added for each manipulated object. 
The object related to nodes of the first type does not need to be observed because 
their poses are usually known from the starting of the assembly, e.g. a work table. 
They can act like fixed constraining references for the other objects. When an 
assembly operation requires of using vision, the pose of some environmental 
objects related to the nodes of the second type would need to be observed. 

The three types of arcs were devised to record the nature of the dependency 
between the objects they link. Two types of arcs (solid in Fig. 3) record alignment 
constraints that describe direct relations originated from contacts and insertions 
produced during current and past manipulations of the objects involved. These 
arcs are generated between the manipulated object and one or more environmen- 
tal objects. A third arc type (dashed in Fig. 3) records alignment constraints 
that describe both indirect insert dependencies and indirect contact dependen- 
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cies among environmental objects. The labels of the arcs indicate the constrained 
DOF of the source node that can be described as a function of the DOF of the 
target node. Between two nodes not linked by any arc, do not hold any align- 
ment dependency, which means that errors in the pose of one of the objects is 
not expected to affect the mating operation of the other. 

Propagation of Alignment Constraints 

New contact formations depicted as solid arcs in an ICdg could reveal new 
alignment constraints among environmental objects. Such environmental ob- 
jects could already participate in alignment constraints with other objects, which 
could reveal further constraints. This situation defines a backward propagation 
pattern that starts from a currently manipulated object up to pre-assembled 
environmental objects. 

This constraint propagation scheme involves at least three objects and is 
fired up by three conditions: (Cl) an object is inserted or enters in contact 
with multiple environmental objects making its object configuration to depend 
upon theirs and the defined dependencies include common constrained DOF 
{joining constraint); (C2) an object is inserted or enters in contact with an 
environmental object, while this environmental object already depends on an- 
other environmental object, and in both cases the dependencies include common 
constrained DOF {inherited constraint); and (C3) an object is inserted or en- 
ters in contact with an environmental object, and this environmental object has 
another environmental object depending on it, and both dependencies on this 
object include common constrained DOF {shared eonstraint). 

Not all the indirect dependencies are relevant for visual sensing. The indirect 
dependencies that are relevant are only those that relate with critical dimensions 
for a future assembly task, when these critical dimensions are not enforced by 
any direct dependency existing before the manipulation of the environmental 
objects were performed. 

Since an environmental object could be manipulated several times before 
the task that made an indirect dependency relevant, the constraint propagation 
method has to determine the manipulation step where vision will be useful for 
correcting any possible error with respect to the critical dimensions. Such step 
is the most recently effected on the environmental object where the critical 
dimension was not constrained. Since the assembly skill primitives are defined 
for each task dimension, the manipulation step selected is the last step where 
such critical dimension required of a move or an insert assembly skill primitive. 

The constraint propagation method is applied for each assembly step. It starts 
determining all the face contacts in which the manipulated object participates, 
and finishes when all the direct dependencies for the manipulated object and 
the new indirect dependencies for the environmental objects are deduced. Since 
the indirect dependencies are deduced for pre-assembled environmental objects, 
a record of the sensing plan evolution is maintained to allow going back to a 
manipulation step for an environmental object and modify its indirect depen- 
dencies. 
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Fig. 4. Sequence of assembly steps. 



ASSEMBLY STEP: 1 
ASSEMBLY PART; body! 
Contact Stale Transition: S*>S 
ASSEMBLY SKILL PRIMITIVES 
Move Skill: Tx.Ty.Tz 

ASSEMBLY STEP: 2 
ASSEMBLY PART: body2 
Contact Stale TransKion: S->S 
ASSEMBLY SKILL PRIMITIVES 
Move Skill: Tx,Ty,Tz 
VISION SENSING 
Alignment{bodyi : Tx.Tz) 

ASSEMBLY STEP: 3 
ASSEMBLY PART: barl 
Contact State Transition: S->S 
ASSEMBLY SKILL PRIMITIVES 
Move Skill. Tx.Ty.Tz 



ASSEMBLY STEP; 4 
ASSEMBLY PART; bait 
Contact Stale Transition: S->B 
ASSEMBLY SKILL PRIMITIVES 
Insertion SkiU: Tx 
Move Skill: Ty.Tz 
PORCt SENSING 

Tx(boayt.body2):dc-skii',kc-skiii cc-skiii 
VISION SENSING 
lnsertion(bodyl. body2. Tx) 



ASSEMBLY STEP; 5 
ASSEMBLY PART: bar! 

Contact State Transition: B*>D 
ASSEMBLY SKILL PRIMITIVES 
Make-contact Skill; Tz 
Move Skill: Ty 
Slide Skill: Tx 
FORCE SENSING 
Tx(boctyl.body2)- kc-skill 
Tz(bodyl,b(Kiy2) dc-skiil.cc-ekili 



ASSEMBLY STEP; 6 
ASSEMBLY PART: covert 
Contact State Transition: S >S 
ASSEMBLY SKILL PRIMITIVES 
Move Sk.ll Tx.Ty.Tz 



ASSEMBLY STEP: 7 
ASSEMBLY PART: covert 
Contact State Transition. S->B 
ASSEMBLY SKILL PRIMITIVES 
Insertion Skill' Tx 
Move SkW: Ty.Tz 
FORCE SENSING: 

Tx{bar1): dc-skill, kc-skill, cc-skill 
VISION SENSING: 
lr«ertion(bar1: Tx) 



ASSEMBLY STEP: 8 
ASSEMBLY PART; covert 
Corriact State Transition: B->D 
ASSEMBLY SKILL PRIMITIVES 
Make-contact Skii Tz 
Move Skill; Ty 
Slide Skill; Tx 
FORCE SENSING 
Tx(bart); kc-skiil 
Tz(bodyt.barl); dc-skill.cc -skill 
VISION SENSING 
Ailgnment(bodyl : Ty) 

ASSEMBLY STEP 9 
ASSEMBLY PART: boltl 
Contact State Transition: S->S 
ASSEMBLY SKILL PRIMITIVES 
Move Skill: Tx.Ty.Tz 

ASSEMBLY STEP; 10 
ASSEMBLY PART, bolt 
Contact Stale Transition: S->E 
ASSEMBLY SKILL PRIMITIVES 
Insertion Skill. Tx.Ty 
Move Skill Tz 
FORCE SENSING: 

Tx(covert ): dc-skill.kc-skill.cc-skil 
Tytcovert): dc-skill, kc-skill, cc-skill 
VISION SENSING: 
lnsertton(cover 1 ; Tx.Ty) 



ASSEMBLY STEP: tt 
ASSEMBLY PART: boltl 
Contact State TransMon: E->E 
ASSEMBLY SKILL PRIMITTVCS 
Slide SkXI. Tx.Ty 
Move Skill; Tz 
FORCE SENSING; 

Txfbodyl .covert): kc-skill 
Ty(body 1 .covert ): kc-skifl 

ASSEMBLY STEP: 12 
ASSEMBLY PART; bOltt 
Contact State Transition; E->H 
ASSEMBLY SKILL PRIMITIVES 
Slide Skill; Tx.Ty 
Move Skill: Tz 
FORCE SENSING; 

Tx(body1, covert): kc-skifl 
Tyfbodyl .covert ); Kc-sKill 
Tz(covert); dc skill, oc-tkill 



Fig. 5. Assembly plan with sensing operations. 

6 A Computational Implementation of the Method 

We implemented the proposed method as a C++ computer application. The 
CAD models of the objects are created with a CSC modeling tool known as 
VANTAGE [7]. 

A nominal assembly plan is composed by a sequence of assembly steps. Every 
step specifies the type of assembly operation, the name of an assembly part to 
be manipulated, the name of the VANTAGE object that represent the model of 







A Method to Obtain Sensing Requirements in Robotic Assemblies 



871 



the part, and the motion parameters as a list of six values (three for translation 
and three for rotation). 

Next, we present a case solved using the proposed method. Fig. 4 illustrates 
the steps specified in the nominal assembly plan; Fig. 3 depicts the final ICdg 
for the case; and finally. Fig. 5 shows the resulting assembly plan including the 
sensing feedback operations. Force sensing is specified by relating force compli- 
ance skills with critical DOF and assembly parts. Visual sensing is specified as 
Insertion for steps that include such assembly skill primitive and as Alignment 
for steps that have indirect dependecies. 

7 Conclusions 

In this paper we have introduced a geometric reasoning method to determine 
force and vision sensing requirements for robotic assemblies. The characteriza- 
tion of assembly steps as transitions of contact states allowed the additional seg- 
mentation of tasks into assembly skill primitives. These primitives together with 
the polyhedral form of the assembly parts allowed the systematic association of 
tasks with force compliance skills that conform force sensing requirements for the 
execution of the assembly steps. Also, from the discovery of insert assembly skill 
primitives and make-contact skill primitives where two or more environmental 
object participate new visual sensing requirements are deduced. 

We have implemented the method as a computational program, tested and 
verified its effectiveness to determine all the expected sensing requirements. The 
method is limited to objects modeled as polyhedra and it is oriented to preventive 
sensing. Future research directions include its extension to objects with curved 
surfaces and the inclusion of sensing for verification and correction. 
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Abstract. Today’s industrial robots use programming languages that 
do not allow learning and task knowledge acquisition and probably this is 
one of the reasons of its restricted used for complex task in unstructured 
environments. In this paper, results on the implementation of a novel 
task planner using a 6 DOF industrial robot as an alternative to over- 
come this limitation are presented. Different Artificial Neural Networks 
(ANN) models were assessed first to evaluate their learning capabilities, 
stability and feasibility of implementation in the planner. Simulations 
showed that the Adaptive Resonance Theory (ART) outperformed other 
connectionist models during tests and therefore this model was chosen. 
This work describes initial results on the implementation of the planner 
showing that the manipulator can acquire manipulative skills to assemble 
mechanical components using only few clues. 



1 Introduction 

Time and cost in industrial production are the factors that had contributed to 
the dedication of the manufacturers of manipulators to improve the speed and 
precision of the systems they offer. Although, the kinematics of manipulators 
has been deeply developed, their sensorial abilities are still poorly developed. Vi- 
sion systems are common in quality control and inspection, whilst Force/Torque 
sensors (F/T) at robot’s wrists are limited to — excluding a few exceptions — 
researching. Sensorial ability seems to be absolutely necessary to provide more 
flexibility, efficiency, and a higher level of autonomy to industrial robots [1]. 

Contact force modeling is very complex within a tridimensional environment 
subject to many uncertainties. Uncertainties in the production line can be orig- 
inated from assembly components’ geometry and location, the position of the 
manipulated object in respect to the final effector, the stiffness matrix of the final 
compensation point, sensors’ noise, and not modeled friction and flexibility [2]. 

Several alternatives have been proposed to solve the problems caused by such 
uncertainties [2]: 
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1. To reduce uncertainties by improving the accuracy of the environment and 
of the robot. This is normally a very expensive process. 

2. To improve the design of the parts to be assembled in such a way that 
it simplifies the assembly process. This is not always possible and rarely 
enough. 

3. To apply active compensation techniques. The programmed path is modified 
by an algorithm that uses sensed contact forces as input. 

Two classes of active compensation can be distinguished: fine movement 
planning and reactive control. The latter is the foundation of the assembling 
methodologies to be employed as a result of the task planning. 

Next sections will present some academic research, the planner, tested neural 
networks, and results. Finally, conclusions and comments will be given. 



2 Background 

Several research efforts have been carried out about reactive control using ma- 
nipulators to the execution of the canonic assembly peg-in-hole [3, 4, 6, 7]. Due to 
the complex nature of the behavior of contact forces, researchers have recurred 
to the utilization of artificial neural networks (ANN). 

Taking the previous work of Lopez-Juarez [3] as a reference, our proposal 
consists in the design of a superior controller or task planner which receives the 
features of the components to be assembled (tasks descriptor) — by means of a 
vision system — as input. The planner is based upon the use of neural networks 
which allows us to take the decision about the kind of methodology that the 
manipulator shall employ when the pieces come into contact. 

The assembly paths will be solved by the tasks planner. These rely upon 
the features of the assembly components (i.e. shape, size, chamfer existence, and 
others). The information source for the tasks descriptor proceeds from a vision 
system, which includes the location and orientation of the assembly components 
needed to complete the assembly process. 

3 Related Work 

A few researchers have applied neural networks to assembly operations with 
manipulators and force feedback. Vijaykumar Gullapalli [7] used BackPropaga- 
tion (BP) and Reinforcement Learning (RL) to control a Zebra robot. Its neural 
controller was based on the location error reduction beginning from a know loca- 
tion. Enric Cervera [6] employed Self-Organizing Map (SOM) and RL to control 
a Zebra robot, but the location of the destination piece was unknown. Martin 
Howarth [4] utilized BP and RL to control a SCARA robot, without knowing 
the location of assembly. Howarth also propounded the employment of tasks 
level programming, using a BP-based neural controller. It was not implemented 
within a manipulator, but the simulation showed acceptable results [5]. Ismael 
Lopez-Juarez [3] implemented Fuzzy ARTMAP to control a PUMA robot, also 
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Fig. 1. KUKA KR15 Manipulator Control Architecture 



with an unknown location. Jorg [1] presents the employment of vision systems 
and force feedback in the assembly of moving components. 

We have seen in neural controllers with force feedback, which are part of our 
proposal at the lower level of the planner, that the movements of the manipu- 
lator are constrained: the maximum number is 12 (corresponding to the spatial 
coordinate system, linear and angular movements). When the F/T-sensor in- 
dicates similar forces in two directions, the controller will chose any direction; 
so diagonal movements (4) will facilitate such a decision, by having a total of 
16 movements. We are currently working to include those movements in the 
assembly neural controller. 

Besides, taking into account the mechanical devices’ error compensation func- 
tion for assembly, we propose the reduction on stiffness of the manipulator’s 
joints in order to diminish the contact forces during assembly. 

We realized that a common characteristic among the reviewed research ef- 
forts was the employment of neural networks for the force feedback control. 
Many of these works demonstrated that using neural networks was adequate for 
execution; even when there were different kinds of neural network architectures 
involved. This reason motivated a new experiment to observe the behavior of 
the most known neural architectures. 

4 Planner Design 

The main task of the planner is to perform mechanical assembly operations 
— explicitly peg-in-hole insertions, given that this is one of the most studied 
problems. The planner will operate within an industrial environment. 

4.1 Hardware Architecture 

The planner is designed for an assembly architecture as shown in Fig. 1. It con- 
sists of a manipulator of 6 DOF, the robot controller, the F/T-sensor mounted 
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Fig. 2. Schematics of the Task Planner 



at the robot’s wrist, the camera and the master computer. The robot controller 
contains the components that provide and control the robotic arm power. The 
master computer has serial communication with the controller, and with the vi- 
sion system. It receives and processes F /T-signals and hosts the Tasks Planner 
and the assembly methodologies (Knowledge Bases). When the planner is exe- 
cuted, the master computer sends low-level commands to the robot controller. 
The camera is mounted over the manipulator’s working area. 

4.2 Tasks Level Planner 

The idea of a planner arises due to the nature and behavior of the contact forces 
in different conditions of the parts to be assembled, e.g. the assembly of parts 
in presence of chamfer shows an obviously different behavior to the assembly in 
absence of chamfer. Besides, shape and size features can also produce different 
behavior of the contact force at the moment of performing the assembly, giving 
greater flexibility to the robotic system to the assembly of different piece under 
different conditions. 

Figure 2 shows the general schematics of the tasks planner. The indicated 
tasks (chamfered assembly, charmferless assembly, and no assembly) are some of 
the tasks that the robotic system must perform. Deciding which task to perform 
is the fundamental work of the planner. 

The task to perform depends on the physical features of the component and 
the environment. Inside the process of feature acquisition, a vision system results 
useful to automate this process. 

The test components were aluminum bars with three different sections (cir- 
cular, squared, and composite — from the two previous) and with three different 
sizes (24.8, 24.9, and 25.0 mm). The holed components (25mm) were also made 
of aluminum, with chamfered and chamferless versions. 
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The steps of the Planner: 

1. Feature acquisition. Obtain the physical features of the components to be 
assembled and their environment (location and orientation within the ma- 
nipulator’s working space). The acquisition of such features could be auto- 
mated by means of a vision system, in order to remove the need of a rigid 
component fixation system. 

2. Collection of the component to be assembled. With the received information 
from the vision system, the manipulator situates itself at the location of the 
component, and orientates its gripper to pick the component up. 

3. Placing of the component to the assembly location. The component is taken 
to the assembly location with the aid of the vision system. 

4. Planner’s input conversion. Every feature is codified in an associated vector 
to be supplied the Tasks Planner. 

5. Knowledge Base selection. The planner will provide the predicted knowledge 
base to be employed to perform the assembly, as long as the assembly is 
possible. 

6. Performing of the assembly. The selected knowledge base is used by the 
neural controller at the lowest level of the planner. 

4.3 Artificial Neuronal Networks 

The use of Artificial Neural Networks in robotics is an alternative approach for 
force feedback control. The type of connectionist network has to be decided on 
the basis of network performance and a previous analysis should be carried out. 

It was decided to make an assessment of connectionist models with suitable 
characteristics in order to be implemented in the Task Planner. The chosen mod- 
els were the Hopfield Network, Backpropagation, Adaptive Resonance Theory 
and Self-Organising Maps (SOM). 

The Hopfield network it is known as a fixed weight network, single layer [11]. 
The Backpropagation is based on the multi layer Perceptron. Once the network 
is trained and if new input becomes available to retrain the network, then it has 
to be trained with all old patterns and new ones. The training has to be made 
off-line and the number of epochs could easily reach hundreds or thousands. 
The Adaptive Resonance Theory is a model developed by Stephen Grossberg 
at the Boston University. It allows unsupervised learning as well as supervised 
learning [12]. The Self-Organising Map was developed by Teuvo Kohonen and 
consists of neurons organized on a regular low-dimensional grid. Each neuron 
is a d-dimensional weight vector (prototype vector, codebook vector) where d 
is equal to the dimension of the input vectors. The neurons are connected to 
adjacent neurons by a neighborhood relation, which dictates the topology, or 
structure, of the map [10]. 

4.4 Simulation Results 

Simulation results on the different ANN performance were obtained implement- 
ing the above network algorithms using MATLAB running on a Pentium III PC 
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Fig. 3. Example of input representation 



at 650 MHz. In all cases the basic algorithm was employed except for the SOM 
where a ToolBox was also used. 

There were used 10 binary patterns like the one shown in Figure 3. These 
patterns represented the characters corresponding to the values 0-9 using a [15 
X 10] matrix and the same test was run for all the networks. 

Hopfield Network. This network was limited and unable to classify more than 
two patterns. It took 0.16 s to train and test the learning of the two patterns. It 
was decided also to test its convergence time (learning and testing) in order to 
compare this timing with the other networks. It was found that for 100 thousand 
epochs, the timing was 298.18 seconds. 

Backpropagation. There were used different topologies: 150-300-150, 150-300- 
10, 150-75-10 for the input layer, middle layer and output layer respectively. For 
the topology of 150-300-10 the learning took place at 500 epochs with 2200 s (36 
min, 40 sec.). The chosen topology was 140-40-10 that after 2200 epochs showed 
satisfactory results with a maximum error of 15%. For the testing phase, the 
network responds in 1.56 s. 

ARTMAP. The network learnt very quickly. Using a learning rate of 0.9, the 
network learned all patterns in two epochs in 0.23 s, even a noise form (0.1 to 
0.3)% was added. 

Self- Organizing Map (SOM). Using the Toolbox from the Helsinki Univer- 
sity. The network was tested for the proposed input patterns. The time response 
was of 1.75 seconds. The simulation resulted in coarse learning for 5 epochs and 
18 for fine learning, with a final quantization error of 0.396. 
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ART-1. The network learned all the patterns in two epochs. With this archi- 
tecture the response was 9.38 s, which included the display of the patterns. In 
later test of this network using C programming, the learning time resulted very 
short and about 6 ms only for processing without considering the display of the 
patterns. 

These results can be summarized in the Table 1: 



Table 1. ANN Performance 



FEATURES 


HOPFIELD 


B.P. 


ARTMAP 


ART1 


SOM 


No. Epochs 


100,000 


2200 


2 


2 


5, 18 


Convergence 
time (s) 


298.18 
Does not 
converge 


216 per 100 
epochs 


0.4 


9.373 


1.572 


Topology 


Single layer 
[150, 150] 


150-40-10 






150-[13,5] 

Hexagonal 


Learning 


Fixed 


Supervised 


Supervised 


Unsupervised 


Unsupervised 



5 Implementation Results 

The formation of the initial knowledge in the robot consists of showing the robot 
how to react to individual components of the Force/Torque vector at the wrist 
of the manipulator. The influence of each vector component requires a motion 
opposite to the direction of the applied force to diminish its effect. The procedure 
is illustrated in Figure 4. 




Fig. 4. Training Procedure 
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Circular Chamfered Peg Insertion 
Offset(mm): x = -2.5, y = -2.5 





STEPS 



Fig. 5. Insertion with learning enabled 



The Intelligent assembly was carried out using aluminium pegs with differ- 
ent cross-sectional geometry. The Fuzzy ARTMAP network parameters during 
experiments were set for fast learning (learning rate = 1). The base vigilance 
~p^ had a low value since it has to be incremented during internal operations. 
Pmap oLnd pb were set much higher to make the network more selective creating 
as many clusters as possible. 

Figure 5 shows data collected during the insertion directed by the Task Plan- 
ner and with learning enabled, this meant that the Planner was allowed to learn 
new patterns. For comparison purposes, another insertion with the same offset 
was carried out, but in this case the robot’s learning capability was inhibited 
(See Figure 6). This means that the robot uses solely the initial learning and no 
patterns are allowed to be learned during the operation. In both Figures, the up- 
per graph represents the force traces whereas the motion directions commanded 
by the Task Planner are given in the lower graph. In the Motion Direction graph, 
the horizontal axis corresponds with the Z- direction. Bars above the horizontal 
axis represent linear alignments and below the horizontal axis represent angular 
alignments. 

Despite that the offset was the same, the number of alignment motions and 
insertion time were higher. With the learning inhibited, the robot was not allowed 
to learn contact states within the chamfer hence the Task Planner generated 
motions based only on its initial knowledge. The robot was ultimately able to 
insert the workpiece component, however the performance was poorer in terms 
of alignment and consequently speed. 
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Fig. 6. Insertion with learning inhibited 



6 Conclusions and Comments 

In this paper a intelligent task-level planner for robotic assembly was proposed. 
After deciding that a connectionist network could offer the appropriate features 
for learning and recognition, it was planned performance analysis for different 
candidate model. Our results showed that in terms of speed both ART and SOM 
networks showed suitable characteristics to be employed in the task planner. The 
use of ART was chosen mainly because of its fast learning response (typically 1 to 
2 epochs) and its robustness against noise. Initial results on the implementation 
with a robot demonstrates the usefulness of our approach. The planner has 
been tested on the lower level and further work is needed. Is it envisaged for 
future work that the planner has to acquire is initial knowledge on-line that is 
that the knowledge base formation should be constructed autonomously. Having 
this feature, the robot could perform operations autonomously without prior 
knowledge of the task. The robot would then be instructed with simple orders, 
i.e. insert, thus creating truly Task level programming. 

Acknowledgments. The authors wants to thank to the Consejo Estatal de 
Ciencia y Tecnologia (CONCYTEQ) and Deutscher Akademischer Austausch 
Dienst (DAAD) for partially funding this project. 
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Abstract. Nowadays, there is a plethora of robotic systems from 
different vendors and with different characteristics that work in specihc 
tasks. Unfortunately, most of the robotic operating systems come in a 
closed control architecture. This fact represents a challenge to integrate 
these systems with other robotic components, such as vision systems 
or other types of robots. In this paper, we propose an integration 
methodology to create or to enhance robotic systems by combining tools 
from computer vision, planning systems and distributed computing 
areas. In particular we are proposing the use of CORBA specification 
to create Wrapper Components. They are object-oriented modules 
that create an abstract interface for a specific class of hardware or 
software components. Furthermore, they have several connectivity and 
communication properties that make easy to interconnect with each 
other. 

Keywords: CORBA, Distributed Components, Robotics. 



1 Introduction 

Nowadays there are many types of robotic systems, and each one is designed with 
a specific task in mind. However, there are some tasks that can be improved by 
using new tools. Integration with more sensory systems is necessary; also, better 
algorithms that increase the dexterity of the robot are needed. This is a challenge, 
and a time-consuming task for the robot designer, mainly because of the closed 
controller architecture in the robotic components. 

Distributed Robotic Systems (DRS) is a very wide and multi-disciplinary 
research area. It is expected that DRS will be able to perform tasks that are 
impossible for a single robot, i.e moving a big load, and will be more reliable 
and efficient. Next robot generation must deal with a wide range of complex and 
uncertain situations. Thus, robot control systems must provide the resources 
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for intelligent performances. Isolated resources have appeared in the modern 
robotics literature; however, there are few frameworks for combine them coher- 
ently into an integrated system [1]. Because of the large number of components, 
it is easy to see that many problems arise when designing a DRS. Specifically, 
the following problems are the most often considered: 

— Construction of the robotic components (hardware implementation). 

— Communication among intelligent robotic components. 

— Reconfiguration of the system to deal with changing environments. 

— Methods of distributing intelligence and control among the components. 

This paper addresses the Communication and Integration problem. We pro- 
pose the use of modular components based on the standard middleware Common 
Object Request Broker Architecture (CORBA) specification formulated by the 
Object Management Group (OMG) [2]. This specification defines a middleware 
component that alleviates the hard task of communicating objects from differ- 
ent platform, operating systems and programming languages. The main idea is 
to hide the internal details of the robotic components, meanwhile they offer a 
standard interface according to the class of component they belong. We named 
our approach as Wrapper Components. 

2 Related Work 

There are several approaches to create solutions for distributed intelligent 
robotic. One of the first approaches was the Task Control Architecture (TCA) 
[3], created at CMU. TCA provides a general control framework for building 
task-level control systems for mobile robots. TCA provides a high-level ma- 
chine independent method for passing messages between distributed machines. 
Its main drawback is the centralized aspect of this communication scheme. In 
CAMPOUT [4], which stands for Control Architecture for Multi-robot Plan- 
etary OUTposts, there is a set of key mechanisms and architectural compo- 
nents to facilitate the development of multi-robot systems. The mechanisms are 
Modular Task Decomposition, Behavior Coordination Mechanisms, Croup Coor- 
dination, and Communication Infrastructure. The communication facilities are 
provided using UNIX-style sockets. This is a low-level communication protocol 
compared to our approach using CORBA. TelRIP [5] is a protocol that allows 
the building of modular tele-robotics networks. TelRIP is a mechanism that uses 
the producer /consumer approach to deliver data objects. The main contribution 
of TelRIP is the capability for measuring and monitoring the communication 
performance, but this is based in data management. Our approach follows the 
object-oriented paradigm. 

Sanz et.al., used CORBA as the middleware for their multi-mobile robot re- 
search project called ARCO, Architecture for COoperation of mobile platforms 
[6]. Integration is one of the biggest problems to tackle in the development of 
large and complex systems using artificial components [7]. Similar to us, ARCO 
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Fig. 1. (a)Basic elements in a Wrapper Component, (b) Two Robot Controller Wrapper 
Components of type A accessed through the same type of interface in the Graphic User 
Interface module. 



addressed the integration problem by using modular, generic, flexible and com- 
patible components. However, ARCO is more oriented to the control issue mean- 
while we are oriented to the construction of robotic applications. 

3 Wrapper Components 

We adopted some ideas from the aforementioned works to construct modular 
DRS. We defined the Wrapper Components using the Interface Definition Lan- 
guage (IDL) provided in CORBA specification. 

3.1 Wrapper Component Definition 

As the name suggests, a wrapper component is a programming module that en- 
capsulates an abstract functionality for specific hardware or software modules 
in the system. The basic configuration consists of three main parts: IDL inter- 
face, Transformation/Interpreter Code and Hardware/ Software Object/Library 
Implementation. Fig. 1(a) shows these parts. 

IDL Interface. It is an interface definition for a particular class of com- 
ponents; its actual definition is a key issue for constructing reusable and easy 
to connect components. We define three basic functionalities for this interface: 
Abstraction, Monitoring, and Configuration. By Abstraction we refer to the 
particular functions set (methods) that a specific component class must have 
without taking into account for implementation details. For example, in Fig. 
1(b) there are two robot controllers which belong to the same class; however, 
the hardware/software components are different. Code modules Cl and C2 are 
different because of the different implementation B1 and B2, but both have the 
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same IDL A definition, so client users of this component can use either without 
change in their code. By Monitoring we refer to the general functions set that 
every component must have to query its internal states. By Configuration we 
refer to the capability of changing the internal range of the component according 
to external requirements. To illustrate the previous two concepts we provide a 
simple gripper example. Suppose we have a robot with a servo-gripper which 
current range is from -10 volts when the gripper is totally closed to -1-10 volts 
when the gripper is totally open. In the design of the system, the range for the 
gripper was set up to the range of [0,100] for an aperture of 10 cm, from totally 
close(0 cm) to totally open (10 cm). If the robot is grasping an object which 
its width size is 7.2cm, then the response for a query of the gripper’s current 
position must be 72, instead of -1-4.4 volts. 

Transformation/Interpreter Code. This is the element that requires 
more effort when implementing the wrapper component. The component builder 
has to do all the data transformation and data interpretation. These steps are 
required to match the data types and data structure from the IDL interface to 
the Hardware/Software object implementation, and viceversa. 

Hardware/Software Object Implementation. Integration into the 
whole system is demanded for this element. Usually, it is defined for a specific 
hardware or it is supplied by a specific vendor. Most of the cases, this component 
is provided “as it is” with an Application Program Interface (API) definition. 
However, it is difficult, if not impossible, to access the low level code. 

3.2 Wrapper Component Features 

We found the following conventional properties: Reusability, Connectivity, Gen- 
erality and Flexibility. However, we are also interested in Abstraction, and Ma- 
nipulation. 

By using CORBA specification we achieved connectivity at least for three as- 
pects: a) platform, i.e. actual computational architecture, b) Operating Systems 
and c) Programming languages. Abstraction and reusability are two related prop- 
erties. While reusability looks for making inter-changeable modules; Abstraction 
helps to create these modules. The more abstract a module is the more general 
its definition is. In the other hand, if we can manipulate a component on-the-fly, 
then we can achieve certain degree of flexibility. Fig. 2 shows the conceptual 
scheme of using Wrapper Components to create application modules. The bold 
line shown resembles the concept of a special interface among the components 
so they can connect each other easily. 



3.3 Design Guidelines for Creating a Wrapper Component 

There is not general technique that can solve every problem that arises during 
the development stage; however, we must consider some aspects before starting 
the development of any module. Some guidelines are opposite among them, so 
a tradeoff must be set according to the requirements on the application or the 
flexibility on the system. 
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Module : Mobile Robot 




IDL 

interface 



Fig. 2. Conceptual scheme to create a Module using Wrapper Components. 

The smaller number of functions, the easier to connect. Defining a small but 
powerful number of functions in the interfaces between the different components 
will give more chances for a good connectivity; however, the required effort for 
good communication is a time-consuming task. Consider the example shown 
below. 

interface Robot { 

void do_action{in string action}; 
void get_status{out string status}; 

} 

In this example the interface is text-oriented, in both ways, no more functions. 
Indeed, we can construct a complete interface using these functions but the effort 
is really tremendous. 

Data Oriented versus Message Oriented. Usually, in control and robotic sys- 
tems, if we have a heavy data transmission, in data size and in frequency, we 
prefer to use a data oriented rather than a message oriented approach. 

Manipulation and Configuration. The Wrapper Components must allow cer- 
tain level of manipulation or reconfiguration in order to be flexible. This is 
accomplished by means of configuration functions defined in the IDL interface. 

Abstract function, an user oriented approach. By an abstract function, we 
refer to the way the component is conceived. For example, an arm manipulator 
can be visualized as a set of axis where each axis has a set of properties, such 
as minimum/maximum range of movement, type of movement (rotational vs 
translational), etc. This is a generic approach that applies to almost all types 
of arm manipulators but the effort to make an application is passed to the 
next level of design. On the other hand, an arm manipulator can be visualized 
as an equipment that can realize a set of functions, such as, Extend/Retract 
the arm. Turn wrist or arm base. Move Up/Down, etc. This is also a generic 
approach where the major effort is in the development of the component but the 
complexity in the next high level is reduced. 
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3.4 CORBA Services in Wrapper Components 

CORBA Services play an important role when designing Wrapper Components. 
In this research we are exploiting two services: Naming Service [8] and Event 
Service [9]. 

Naming Service: Every time a component starts its execution, it registers into 
a defined Naming Server with a specific name. This is a standard feature for all 
components. If another component appears and it needs communication with 
the previous component, the only action to do is to “resolve” the name. 

Event Service: This is an advance communication scheme. Usually, it is used in 
a publisher /suscriber scheme to communicate 1 to n anonymous components; 
also for sending data in an asynchronous way. 

4 Application 

We developed several wrapper components for our robotic system. Due to lim- 
ited space, we will only present part of the Vision System. The Vision System 
has representative features. It is a very important component to improve the per- 
formance of the robotic units. However, its integration has been accomplished 
traditionally as a module tied to the robot software bundle. Mainly to improve 
the response time of the whole system. In our equipment we account for two 
stereo cameras mounted in similar pan-tilt units. See [10] for more details, also 
Fig. 4 shows a partial view of this setup. 

4.1 Vision System Abstraction 

We want some characteristics for this system without focusing to a particular 
hardware. For example, it is desirable that the vision system captures images 
periodically. In the high level, it is expected that the vision system can realize 
an object recognition process or a complex task, such as object tracking. Mean- 
while in the middle level, it could be a specific image processing task, such as 
edge detection or blob-motion detection. Based on these features we define the 
following points for the Vision System: 

IDL definition. The following interface definition assumes certain properties 
from the low level control, such as, a maximum frame rate available from the 
framegrabber. The functions defined are biased to manipulate the vision system 
in a high level aspect. 

interface vision { 

void LearnCin short x, in short y, in string name); 

void FindCin string name) ; 

void TrackCin string name); 

void End_Track(void) ; 

void GetObj (void) ; 

> 
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Learn is an abstract function that invokes a learning process in the vision sys- 
tem. In our current implementation, this task is based on the feature extraction 
of a solid object whose approximated center position {x, y) is passed as an ar- 
gument, i.e. an user can select the object by clicking inside of it from an image 
displayed through a graphic user interface. Learn is an abstract function, the 
way it is implemented can vary from one vision system to other. For example, 
one system can use the external shape of the object to identify some invariant 
properties, other system can use points of interest and the correlation between 
them, while other system can identify some correlations between specific parts 
of the object, but all implementations must have a method to save the name 
of the object with the features associated to it. The function Find looks for 
the object passed with argument name. First the function Find searches in the 
object database to check if the specific name object has been learned. If the ob- 
ject exists in the database, the next step is a searching procedure in the current 
image captured on the framegrabber. If the object is found, an approximated 
center position point is returned. Other implementation could return the top- 
left position of the object with an accuracy metric. GetObj is a function that 
returns a list of previous learned objects. Track and End_Track are explained 
in the next section. 



Communication Channels. From the previous IDL definition it can be ob- 
served that none of the functions return a value, neither as an argument or as 
a type returned value. This is decided in this way because of the functionality 
required from the vision system and to take advantage of the CORBA services. 
Thus, the response of the system is captured through other communication chan- 
nels based on Event Channels or Event Service. Fig. 3 depicts this idea. The 
Event Channel for images is used to send compressed image frame data. This 
information is sent periodically based on the velocity of the framegrabber. For 
example, from Fig. 3 the graphic user interface (GUI) is connected at any time 
to the Image Event Channel to receive images and to display them in a screen 
monitor. Next the user can select one object from this interface as an object to 
be learned by the vision system. The Response Event Channel is used to send a 




Stereo Camera 



Fig. 3. Communication channels for the Vision System 
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Table 1. Command Response Format for Vision Feedback. 



Command 


Response 


LEARN 


(LEARN ”Tag_name” OK) 
(LEARN "Tagmame” FAIL) 


FIND 


(FIND "Tagmame” OK (x,y) ) 
(FIND ’’Tagmame” FAIL) 


TRACK 


(TRACK number _of_objects (”Tag_name_l” OK (x,y, z) ) . . . 
(” Tagmame m” FAIL)) 


END.TRACK 


stop all messages of the tracking procedure. 


GET.OBJ 


(GET number _of_objects (”Tagmame_l” , . . . ,”Tagmamem”)) 



feedback of the central position of certain objects. This activity is started using 
the function Track. The response message uses a LISP-like format to send a 
variable number of tracked objects. Table 1 shows part of the message format 
for the commands mentioned in this article. The message transmission is ended 
by the command End .Track. 



4.2 Experimental Tests 

In our current configuration we used the vision system component together with 
the robot controller component to move the arm manipulator so it can be able 
to grasp an specific object. This object was previously learned by the system. 
Fig. 4 shows the setup for this experiment. There is a “robot’s brain controller”, 
as it is shown in Fig. 2. This component interacts with the user or operator to 
receive the next action or activity to execute. When it receive the order to grasp 
an object, it realizes the following sequence of activities. 

1. Ask to the vision system for the list of learned objects. If the object exists, go 
to next step, but if the object doesn’t exist it responses to the user indicating 
that such object name is not recognized. 

2. Send command Find object. This is to assure that the previous learned object 
can be found in the current image. If the object is found it proceeds to next 
step. Otherwise it replies with a object-not-found message. 

3. Send a Track command for two objects: End-Effector and the object to grasp. 

4. Get into a conditional loop to move the arm manipulator according to the 
position error between the End-Effector and the object. The position error is 
calculated with the information received through the response event channel. 

5. When the error is below a specific threshold the Track procedure is ended 
and an End.Track command is issued. Next, the brain controller sends the 
close.gripper command to the robot controller. 

This sequence of activities follows a natural path similar to what a human 
being could do. Also, our design follows a non-hlocking communication scheme 
by using the producer /consumer approach. This is a very important aspect due 
to there are distributed components. So, if one of the components fails or is 
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Objects 



Stereo-Came 



End-Effector 
Robot- 1 



Fig. 4. Experimental setup for the integration of wrapper components. Left, a stereo- 
camera is taking visual information about the arm manipulator and the objects to 
identify and to grasp. 



missing, the component’s designer has several alternative paths to conclude the 
task. 

Following we enumerate some comparison between distributed non-abstract 
(DNA) approach vs distributed abstract approach (DA). 

1. Modularity. Both approaches offer modularity, but DA offer a better per- 
formance in the high level of design. 

2. Development time. For small project DNA offer a quick response, but for 
long projects DA evolves better. 

3. Reusability. In this case DNA could suffer with small changes but in DA 
the impact of the change is isolated. 

4. Maintenance. DNA works fine for small projects but it will suffer for large 
projects. 

5 Conclusions 

In order to provide an effective solution for the Integration and Communication 
problems when designing a Distributed Robotic System (DRS) , we proposed the 
use of Commercial-of-the-Shelf (COTS) middleware components to construct 
a communication framework for designing DRS. We named these elements 
Wrapper Components, because of the properties they must have to hide 
the internal details of their particular implementations while offering a set of 
communication facilities to connect the components into the whole system. 

We presented a Vision System as a Wrapper Component to illustrate the 
communication and integration aspects. 
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Abstract. In this paper, fuzzy sliding mode controller based on genetic 
algorithms is designed to govern the dynamics of rigid robot manipulators. 
When fuzzy sliding mode control is designed there is no criterion to reach an 
optimal design. Therefore, we will design a fuzzy sliding mode controller for 
the general nonlinear control systems as an optimization problem and apply the 
optimal searching algorithms and genetic algorithms to find the optimal rules 
and membership functions of the controller. The proposed approach has the 
merit to determine the optimal structure and the inference rules of fuzzy sliding 
mode controller simultaneously. Using the proposed approach, the tracking 
problem of two-degree-of-freedom rigid robot manipulator is studied. 
Simulation results of the close-loop system with the proposed controller based 
on genetic algorithms show the effectiveness of that. 



1 Introduction 

Several well-known difficulties must be overcome in the Fuzzy Logic Control (FLC) 
design as follows: (1) Converting the experts’ knowledge how into if-then rules is 
difficult and the results are often incomplete and unnecessary, since operators and 
control engineers are not capable of specific details or cannot express all their 
knowledge including intuition and inspiration. (2) Characteristics of fuzzy control 
systems cannot be pre-specified. (3) It is hard to search optimal parameters of 
controller to achieve maximum performance. To overcome (1) and (2), the fuzzy 
sliding mode (FSM) control [1,2] schemes are proposed. Essentially, the FSM control 
design can be considered as an optimization problem for multi-parameters to ease 
difficulty (3) [3-5]. For improving performance of the sliding mode control (SMC) [6- 
12], based fuzzy controller, we adopt optimal searching algorithms, that is, the genetic 
algorithms (GAs). The GAs have been demonstrated to be a powerful tool for 
automating the definition of the fuzzy control rule base and membership functions, 
because that adaptive control, learning, and self-organization can be considered in a 
lot of cases as optimization or searching processes. The advantages have extended the 
GA in development of a wide range of approaches for designing fuzzy sliding mode 
controllers over the last few years. This work presents a Fuzzy Sliding Mode Based 
on Genetic Algorithms (FSMBGA) control design applied to tracking problem of 
two-degree-of-freedom rigid robot manipulator. The simulation results show that the 
FSMBGA controller exhibit better and faster response in comparison with FSM 
control action. 
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2 Fuzzy Sliding Mode Control Design Based on Genetic 
Algorithms 

2.1 Sliding Mode Control Design 

A Sliding Mode Controller is a Variable Structure Controller (VSC). Basically, a 
VSC includes several different continuous functions that map plant state to a control 
surface, and the switching among different functions is determined by plant state that 
is represented by a switching function. 

Consider the design of a sliding mode controller for the following system 

z{t) = A{z{t) -Zd) + Bu{t) + f{z, u,t) (1) 

where z^ is reference trajectory and u(t) is the input to the system. We choose m 
switching functions as follows 

Si(z) = CiZ = CaZi +C12Z2 +... + C;„ 2 „ ( 2 ) 

where c,- = a sliding vector and n is the number of states. We 

rewrite Equation (2) in the form 

s(z) = cz (3) 

where c = [c|,...,c^7^ . 

The following is a possible choice of the structure of a sliding mode controller [13] 
u = (4) 

where 

u^q =-{cB)~^cAz 

(5) 

=-(c5)”^(7+(7)p 

The control strategy adopted here will guarantee a system trajectory move toward 
and stay on the sliding surface s = 0 from any initial condition if the following 
condition meets 

5 s < -cr|^| (6) 

where ;/ is a positive constant that guarantees the system trajectories hit the sliding 
surface in finite time [13]. 

It is proven that if k is large enough, the sliding model controllers of (2) are 
guaranteed to he asymptotically stable [13]. 



2.2 Fuzzy Sliding Mode Control Design 

The fuzzy control rule is the spirit of fuzzy control design. However, when the fuzzy 
variables are more than two, establishing a complete fuzzy rule set becomes difficult. 
The SMC guarantees the stability and robustness of the resulting control system, 
which can be systematically achieved but at the cost of chattering effect. The FSM 
controller is a hybrid controller, which combines the advantages of the fuzzy 
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controller and the sliding mode controller. The combination becomes a feasible 
approach to rectify the shortcomings and preserve the advantages of these two 
approaches. The structure of fuzzy sliding mode controller is described as follows: 
According to the control law (4) with one switching function, we have the fuzzy 
control rule j [3] as 

R': If s is A\ then u is u.. 

where j=-q,-q + l,...,q, s is obtained from Equation (3) for one switching function, 
and Aj is a linguistic value with respect to s of rule j. The definition of membership 
function is 






s-a 



i-i 



^j+i ~ 



,a j <s<a 



(7) 



where a j is the centre of jth membership function. The triangle membership function 



is determined by three parameters, <Ty_] , (7 j and Cj+i ■ The definition of membership 



functions is symmetrical, that is, <7^ = 0 , <T_j = ,..., <T_^ = -<T^ . The control law 

Ujis 

Uj=kjSgn{Sj) + u^^ ( 8 ) 

where 



Ue^=GjZ (9) 

in which Gj = i,g j 2 ,---,gj and j=-q,-q + l,...,q. With respect to the SMC, the 

parameters are assumed as follows: 

q = l CTo = 0 (7_i=ai=£ with 0 

G_i = Go = Gi = cA (10) 

k_i = {cB)~^ (7+ C7) k^=0 k^= -(c5)”‘ (r+ a) 



2.3 Fuzzy Sliding Mode Control Based on Genetic Algorithms 

In the previous studies [3-7], the structures and parameters of control rules decide the 
performance of fuzzy control. From the control point of view, the parameters of 
structures should be modified automatically by evaluating the results of fuzzy control. 
In this section, we will introduce the GAs to the problem of determining and 
optimizing the FSM control for a given system. The key to put a genetic search for the 
FSM control into practice is that all design variables to be optimized are encoded as a 
finite length string. Each design is represented by a binary string, which consists 
various smaller strings that can be decoded to the value for each design variable. 
According to the structure and parameters of the FSM controller in previous section, 
individual multivariable binary coding is arranged in the following form 











0-1 


0-2 






k 
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where Cj,] = \,2,...q are parameters of membership functions in antecedent fuzzy sets 

as shown in Equation (7), kj and Gj are parameters of the consequent part as shown in 
Equations (8), (9). The binary string of this FSMC has 1 + 4n+3q +4nq variables. 

Fitness as a qualitative attribute measures the reproductive efficiency of living 
creatures according to the principle of survival of the fittest. In the FSM controller 
design, the parameters of controller are determined and optimized through assessing 
the individual fitness. In order to employ the GAs to optimize the FSM controller for 
the system, we establish the fitness function according to the objective of active 
vibration control. Thus the FSMC design based on the GAs can be considered as an 
optimization search procedure over a large parameter space. For the active vibration 
control, we define the performance index [4] as 

I M 

^ — Zkl (ID 



where sp. is the value of switching function s at the kth time step, M = int(tj„ax ! 



denotes the number of computing steps, trnax is the running time, and At is the 
sampling period. The fitness function can then be defined as 



F = 



J + T 



( 12 ) 



where r is a small positive constant used to avoid the numerical error of dividing by 



The GAs control parameters play an important role in the procedure of optimizing 
the parameters of the fuzzy logic controller. Some worthwhile discussions of the GAs 
parameters are made as follows: 

• Encoding form: The linear encoding form is used. The length of binary 
coding string for each variable is important for the GAs. There is always a 
compromise between complexity and accuracy in the choice of string length. 
Here, a 16-bit binary coding is used for each parameter. 



• Crossover and mutation rates: Crossover and mutation rates are not fixed 
during evolution period. At the beginning, crossover and mutation rates are, 
respectively, fixed to 0.9 and 0.1, then decrease 10 percent in each generation 
until crossover rate is 0.5 and mutation rate is 0.01. 

• Population size: The population size has to be an even number and is kept 
fixed throughout. Generally, the bigger the population size, the more design 
features are included. The population size should not be too small, but the 
procedure of optimizing will be slow when the population size is big. 



3 Two-Degree-of-Freedom Robot Manipulator Model 

We begin with a general analysis of an n-joint rigid robotic manipulator system whose 
dynamics may be described by the second-order nonlinear vector differential equation 
[14-17] 



M(q)q + h(q,q) = u(t) 



(13) 
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where q{t) is the mxI vector of joint angular positions, M(q) is the nXn symmetric 
positive definite inertia matrix, h(q, q) is the n X 1 vector containing Coriolis, 



centrifugal forces and gravity torques, u(t) is the n X 1 vector of applied joint torques 
(control inputs). 

The dynamic equations of the two-link robotic manipulator are expressed in state 
variable form as x\ = q\,X 2 = q\, X 2 = 9'2,-^4 = with x=[xl x2 x3 x4]T. The 



dynamics of this specific system is given by the equations 

Xi =X2 



^2= — [^■^(■*2+Tt)| 1 + 



2 ^ 
«2 






+ Yyg+Uy 



^ j . [a\irig-bxi+th)-a\iy\g+ M2)))] 






(14-a) 



(14-b) 



X3 = X4 



(14-c) 



1 2 

Xa= ~[a]^{y2g-bxA +U2)-a2{bx2{x2+XA)+Yig+U2}\ (14-d) 



= (nil +M?2)ri^ +ni2r2 +2ni2rir2 cos(x3)+/i 


(15-a) 


fl2 = f^ 2 ri + 2 ni 2 rir 2 cos (4:3) 


(15-b) 


h = m 2 rj Tj sin (x 3 ) 


(15-c) 


Y\ =-(('^1 +'^2)'! coix 2 )+m 2 r 2 co^Xj -I-X3)) 


(15-d) 


Yl =-('”2 ''2 COs(xi -I-X3)) 


(15-e) 



4 Simulation Results 

To assess the FSMC based on GAs, simulation results of a two-degree-of-freedom 
robot manipulator with proposed control action are obtained. For simulation the 
following parameters are considered 

q = 1.0m, ^2 = 0.8m, = 5Kgm, J 2 = 5Kgm 

mj = 0.5Kg,m2 =l.5kg,g = 9.SKgm / 

Pi =50, P2 =50,^3 =1,^4 =1 

Using the proposed method the nonlinear optimal control law for and U 2 can be 
obtained. In this section the MATLAB simulation highlights the operation of the 
manipulator when tracking to an oscillatory reference signal is considered. Here the 
desired trajectory reference signals are defined as 
~ _ j-0.75sin(;zr/20), 0<t<10 

■^‘”[-0.75, t>10 

f2;r sin(.® / 20), 



0<t<10 
f >10 
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The initial state values of the system are selected as 
[x^ X2 x-i xj =[-Q.l5 -0.2 0.349 0.98?f 



4.1 SMC Design 

At first stage, we choose one switching function as 

s = cz = qZ] -(-... -h CgZg in which c = [l 1 1 l] . 

[2.3955 -8.4554 3.0288 5.595] 

= 25sign(s) 



4.2 FSMS Design 

In this case we choose the switching function as the case of SMC. 

Then, we have the following three control rules (q=l) 

: If s is A®, then u is G_iZ-l-A:_[. 

R° : If i is A®, then m is G^z + kg. 

R^ : If s is A®, then u is GiZ + k^. 

where the membership functions with respect to fuzzy sets NB, ZO, and PB are 
shown in Fig. 1. In this paper, the design parameters of the FSMC are selected as 
follows: 

(7i= 0.12 fe_i=25 k„=Q ki=-25 

G_i = G„ = Gi = [2.3955 -8.4554 3.0288 5.595] 




Fig. 1. The membership function of fuzzy sets. 



4.3 Design of FSMC Based on GAs 

The design parameters of the FSMC based on the GAs associated with the above 
three control rules (q=l) are specified as follows: sampling time interval = 0.02sec, 
population size = 50, initial crossover probability = 0.9, initial mutation probability 
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= 0.1, bit length for parameter =16, generations = 60, ^ , k, c and G are [0,1], [25,25], 
[0,15] and [0,15], respectively. The optimal parameters of the FSMC are generated 
after 60 generations, namely, 

c = [3.381 2.157 7.670 4.67l] 

(7i =0.0316 yt_i=24.6 k^=931 =-22.74 

G_i =[-12.750 9.305 5.765 -9.857] 

G„ =[3.208 -1.4712 -4.643 -1.4772] 

Gi =[5.472 2.361 -6.706 -1.2082] 



4.4 Results and Discussion 

The closed-loop system responses using the Fuzzy Sliding Mode (FSM) and Fuzzy 
Sliding Mode Based on Genetic Algorithms (FSMBGA) are shown in Fig. 2 where 

y ~ and ^ yd js (jjg desired reference). As it can be seen, the responses 
using FSMBGA is better with less error than that of FSM. Also, the tracking time, Tt, 
for FSMBGA is 1.879sec while it is 4.43sec for FSM. It is observed that employing 
FSMBGA can provide a faster tracking response in comparison with the response 
obtained by employing FSM. Generally, a faster response will require more control 
effort. It is inherent in the present controllers that fast tracking response is associated 

with larger peak values of the control force . 

In the other simulation the performance of the controllers is in response to the case 
of parameter variations. Indeed, assume that the system is now suffering from a 
varying payload with the mass m 2 within the range of = 0.5A:g 

and ni 2 max = . The control responses obtained by employing the controllers at 

m 2 = 2kg are shown in Fig. 3. As it can be seen in this case the FSMBGA performs 
better and faster than FSM. 



5 Conclusion 

In this paper, the FSMBGA controller for tracking control of two-degree-of-freedom 
robot manipulator has been developed. First, the FSM controller is introduced. 
Designing an equivalent control and a hitting control give the membership functions 
of consequent part. The membership functions of antecedent part are defined for 
stability requirement. Secondly, a FSM controller is developed through the GAs, i.e. 
we design the optimal parameters of the FSM without any experts’ knowledge. 
Simulation results of the system with the proposed FSM and FSMBGA controllers 
showed that FSMBGA has better and faster response than FSM. Also, the robustness 
of FSMBGA controller against parameter uncertainty was better than that of 
FSM one. 
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Fig. 2. Closed-loop system response using FSM and FSMBGA controllers. 




Time (sec) Time (sec) 

Fig. 3. Closed-loop system response using FSM and FSMBGA controllers with ^2 ^^8 
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Abstract. This paper describes how collective behavior can be 
achieved using simple mechanisms based on local information and 
low-level cognition. Collective behavior is modeled and analyzed from 
the spatial point of view. Robots have a set of internal tendencies, such 
as association and repulsion, that enable them to interact with other 
robots. Each robot has a space around its body that represents a piece 
of the puzzle. The robots’ goal is to find other pieces of the puzzle, 
associate with them and remain associated for as long as possible. 
Experiments on queuing using this puzzle-like mechanism are analyzed. 

Keywords: Collective robotics, spatial coordination, proxemics. 



1 Introduction 

This research focuses on the design of collective behavior for autonomous mobile 
robots based on simple mechanisms. By simple mechanisms we denote those that 
depend on local information and low-level cognition, z.e. mechanisms used by 
robots that have limited capabilities and limited knowledge of the environment. 
The kind of collective behavior we are interested in involves the arrangement and 
maintaining of spatial patterns by a multi-robot system, such as formation and 
flocking. These behaviors are useful for a number of applications such as material 
transportation, hazardous material handling and terrain coverage. The central 
idea of our work is to implement collective behavior using a domain independent 
mechanism [4]. 

The paper presents the proxemic coordination, a situated distributed method 
for collective problem solving. It is situated because it relies mainly on the infor- 
mation perceived by robots, instead of on a description of the environment. As 
robots determine their actions locally, the model is distributed. This mechanism 
is based on the spatial coordination of a group of robots in approximate patterns. 
For that, each robot has a space around its body called the proxemic space, that 
should be preserved from contact with objects and kin. Robots also have a set 
of internal tendencies, such as association and repulsion, that enable them to 
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interact with others. A robot looks for its kin in order to associate with them 
and remain associated for as long as possible. But if the internal tendencies are 
modified, it can then avoid its kin. The association is performed by assembling 
the frontiers of proxemic spaces. In contrast, the repulsion supposes the moving 
away of robots. Robots and their territories are the pieces of a puzzle whose shape 
is defined by the application: a column for queuing or a square for formations. 

The paper is organized as follows: section 2 addresses related work. Section 
3 presents our proposal. Sections 4 and 5 describe various experiments and give 
some technical details. Section 6 examines results, and section 7 discusses con- 
clusions and perspectives. 



2 Related Work 

Spatial coordination in groups of agents has mainly been studied in simulation 
[3,14]. Interesting experiments where several agents [6] or robots [9] coordinate 
and synchronize their movements have also been reported. Spatial organization 
and flocking have been largely studied in the literature and a variety of simulated 
experiments has been presented [12,14]. 

Social potential field [15] are very close to our research. In this, artificial force 
laws between robots producing both attraction and repulsion are defined. The 
method is robust and efficient, it relies on a certain amount of global information 
and on a direct communication between robots. Thus, robots have to exchange 
their absolute positions in order to perform force calculations and to coordinate 
their movements. The method has been applied to model collective behaviors 
such as clustering and guarding, but only using simulated robots. This is due to 
the difficulties to calculate social potential fields in real-time. 

A similar method using motor-shema instead of force laws has also been ex- 
plored [1]. In this approach, modular behaviors are composed in order to achieve 
spatial formations such as lines, columns and diamonds. The method has been 
tested using simulated and physical robots. More recent results of this work [2] 
describe robots with attachments sites in their body that determine the spatial 
structure formed by robots. 

Self-assembling robots is a very active research avenue in the robotics commu- 
nity [10]. Mechanisms composed of simple robots that are able to adopt various 
shapes and reconfigure themselves have been physically built [5,17] and simu- 
lated [13]. The design and implementation of these systems, which combine a 
lot of computing and engineering skills, are beyond of the goals of our current 
research. However, the general principle that enables robot-pieces to attract and 
assemble, has inspired us to propose a puzzle-like mechanism to display collective 
behavior in robotics. 

The research reported in this paper is similar to social potential fields and 
to spatial formation with attachments sites. Our approach is different from the 
first in the application of repulsion and attraction forces. Whereas in the work 
mentioned these forces are used to avoid obstacles and guide robots to the goal, 
in ours they are directed to the spatial dynamic coordination of robots. Our work 
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is also different in that our robots do not exchange information about their posi- 
tions. In contrast with the second method, we use hierarchical behaviors instead 
of motor schema and the attachment sites of our robots are not fixed. These sites 
can be modified during an experiment resulting in a different global behavior. 



3 Proposal 

Coordination is at the core of puzzle- like mechanisms. But if self-assembling 
demands a lot of exactness, queuing and formations are less demanding as to 
the precision of the movements executed by participants. From a global point 
of view, the area formed by a group of robots coordinating their movements 
can be considered as a shape that robots are trying to preserve. This shape is 
of course non- fixed, it can be reconfigured according to the rules of assembling 
applied by robots. 

We propose a method to coordinate approximately the movements of a group 
of mobile robots that have to organize themselves spatially. Neither direct com- 
munication nor centralized control is involved in this strategy. Instead, a robot 
must be able to delimit a space around itself, a proxemic space, and perceive its 
kin. Both capabilities are necessary to define the internal tendencies of robots. 
This method is called proxemic coordination. 

The notions of proxemic space, kin recognition and internal tendency are 
discussed below. 

3.1 Proxemic Space 

The proxemics is a notion introduced by the anthropologist Edward Hall [7]. 
According to him, the space plays an essential role in social systems. Individuals 
define and organize the space around their bodies. Their behavior is then closely 
related with the interactions perceived within this space. 

The idea that an individual moves surrounded by a bubble and that the 
bubble arrangement influences his behavior, inspired us to propose a method 
of proxemic coordination for a group of robots. The proxemic space is for us, a 
virtual space defined by a robot as an extension of its body. 

3.2 Kin Perception 

Kin recognition is a requirement for the generation of complex behavior in groups 
of robots [8] . Spatial coordination needs often mechanisms of kin perception and 
recognition. Robots should be able to distinguish not only their environment, 
but also their kin. 

Being able to recognize their kin is considered a basic skill to apply proxemic 
coordination. In order to recognize themselves, robots may use a set of features, 
such as colors or visual cues, to distinguish between their kin and other elements 
of the environment. These mechanisms are, as we see below, based on local 
robots’ perception. 
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3.3 Tendencies 

In addition to the ability to delimit a proxemic space, robots must be able to 
assemble their spaces in specific manners. Like in puzzles, where the shape of the 
pieces determines how they are assembled, the robots have attachment labels. 
These labels may be viewed as plus (+) and minus (— ) polarities that are used 
by robots to connect their spaces. 

Robots are guided by two internal tendencies: association and repulsion. 
Association is defined as the attachment of opposite labels (+/— or — /+), and 
repulsion as the avoidance of similar labels (+/+, —/—)■ 

Attachment labels may also change during an experiment according to spe- 
cific situations. A robot with fiat-battery taking part in a formation, for instance, 
has the right to change its polarities and withdraw from collective formation. 



4 Simulated Experiments 

We have programmed a virtual environment using Starlogo©^ in order to im- 
plement our proposal. The rules of proxemic coordination are used in this ex- 
periment to enable a group of robots to reach a global formation. Robot’s goal 
is to form a line in front of a predefined landmark, but the landmark location is 
not known by robots. 



4.1 Proxemic Space 

The proxemic space is a rectangle in face of a robot. A robot can perceive if this 
one-unit dimension space is clear or not. The proxemic space is clear if neither a 
robot nor a border of the environment is perceived there. This condition is used 
by robots in order to decide which behavior to execute. 

The robots are able to execute three behaviors: wander, avoid obstacles 
and queue. A robot can move 0.1 steps forward in the direction that it is facing 
and turn right by 5 degrees. 



4.2 Kin Perception 

A robot can detect if another robot is located one unit directly in front of its 
proxemic space. If a partner is detected, the robot can also detect its orientation. 
The partner orientation is useful to determine if an attachment label is “visible” 
or not. 

^ Starlogo is a programming environment of decentralized systems, developed at the 
Massachuset Institute of Technology, available at: 
http://education.mit. edu/starlogo/ 
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(a) (b) 

Fig. 1. Number of robots in formation in two groups of robots during 60 sec. The 
experiments were performed using subgroups of 5, 10 and 15 robots that were randomly 
located in the environment. The attachment labels of the first group (a) were fixed 
whereas those of the second group (b) were modified during experiments. The second 
group managed to solve conflicts when two robots are attracted by a same third robot. 



4.3 Tendencies 

Two attachments labels were defined. A label plus was situated on the robot’s 
head and a label minus on the robot’s rear. Robots are attracted by attachments 
labels minus and by the landmark indicating the beginning of the line. 

Depending on the distance between attachments labels, robots assemble their 
proxemic spaces in three different manners: approximate, standard and accurate. 
That is, a robot can be attracted by an attachment label situated, respectively, 
±20°, ±10° and ±5° away from its own attachment label. The quality of a 
formation depends on this assembling precision. 

This tendency enables robots to follow themselves and to form breakable lines. 
That is, lines that are easily unarranged when an object is perceived within 
proxemic space, what happens often in an environment where various robots 
are wandering (see figure 1(a)). In order to remain still and form static lines 
robots change their attachment labels once they are assembled. This modification 
acts as a blocking mechanism that protect formations and contribute to solve 
conflictive situations (see figure 1(b)). 

Figure 2 shows some snapshots of our system. As we can see, the robots 
did not form fine lines, but they reached global formations using only local 
information. 



5 Physical Experiments 

We have used the rules of proxemic coordination in order to enable a group of 
physical mobile robots to form a line in front of a landmark. The experiments 
described in this section were conducted using three Pioneer 2-DX mobile robots 
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(a) (b) 

Fig. 2. Two snapshots of our system with 10 robots that were randomly placed. Robots 
assembled their proxemic spaces using approximate (a) and accurate (b) precision. The 
location and the state of the robots at three different instants are also shown. During 
these experiments the attachment labels changed once robots were assembled. 
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of Active Media©, provided with odometers, bumpers, sonars, radio modems, 
video-cameras and onboard computer. 

5.1 Proxemic Space 

A robot uses two sonar arrays to delimit its proxemic space. Based on sonar 
readings, the robot can determine whether or not its proxemic space is clear 
(see figure 3). The sonars’ sensitivity ranges from 10 centimeters to more than 5 
meters and can be adjusted in order to see small objects at great distances. The 
sensors’ sensitivity determines the possible limits of proxemic spaces. 

Robots are able to execute four behaviors: wander, avoid obstacles, 
queue and adjust position. The last behavior enables robots to assemble the 
frontiers of their proxemic spaces as accurate as possible. 



X 




Fig. 3. A robot is equipped with 2 sonar 
arrays of 8 sonar each (left). In our exper- 
iments, proxemic space is a virtual rect- 
angle situated ahead of the robot (right). 
Sonars are used to calculate the distances 
along X, dx, or along y, dy, to the nearest 
object within this rectangle. 



P * 




Fig. 4. An environmental landmark 
(right) and a robot with a cylinder cov- 
ered by a visual cue set horizontally 
(left). 



5.2 Kin Perception 

We have developed a cue-based recognition system in order to differentiate vi- 
sually the environment and the kin using the CCD cameras of the robots [16]. 
A cue consists of black bars on a white background. This system is based on the 
recognition of two kinds of cues: environmental landmarks and robots cues. The 
former distinguish elements of the environment, such as objects and walls, the 
latter are worn by the robots as identity cues (see figure 4). 

Recognition is based on a picture analysis method that we call the railroad 
method. This name comes from the following analogy: when someone who is 
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wandering comes upon a railroad, he follows it. As he follows the rails, he counts 
the railway sleepers. Similarly, in our method the pictures are scanned and if a 
succession of three black bars are encountered (the rails) we follow their direction 
and count the bits (the railway sleepers). For details see [16]. 

This system can obtain the position of the cues in the picture, their own 
identifiers, the distance between the cues and the camera, and their orientation. 
Additionally, the system enables us to distinguish correctly eight angles of iden- 
tity cues worn by robots in movement (0°,45°, . . . , 315°). These angles are used 
to put virtual attachment labels on our robots 



5.3 Tendencies 

As in line-formation simulated experiments, two attachment labels were defined: 
a label plus on the angle 0° and a label minus on the angle 180° of identity cues. 
Robots are attracted by labels minus and by an specific environmental landmark 
indicating the beginning of the line. 

Once a robot has assembled its proxemic space, it changes its attachment 
labels in order to remain still. Since attachment labels are virtual, robots commu- 
nicate and update the state of their labels continually.^ 

Figure 5 illustrates the environment and the robots in action. Figure 6 illus- 
trates the trajectories followed by robots during this experiment. 




Fig. 5. Snapshots of three mobile robots wandering in a corridor and assembling their 
proxemic spaces in line. 



^ Althongh physical robots commnnicate, this commnnication is not intended to im- 
prove their coordination efforts, bnt to inform about the polarities of their proxemic 
spaces 
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Fig. 6. Trajectories followed by three physical robots that form a line using the method 
of proxemic coordination. 



6 Discussion 

Reproducing simulated experiments using physical robots is a major challenge 
for roboticists. Most of the rules used in simulation are poorly situated and 
are only useful for idealized robots that are equipped with perfect sensors and 
actuators, and are able to perform actions without fault. 

Our method of proxemic coordination is well adapted for solving problems 
of spatial coordination using simulated and physical robots. As we could see, 
robots reached global spatial structures taking into account local information. 

Robots using the rules of proxemic coordination depend mainly on their 
sensors in order to operate. If the landmark indicating the beginning of the line 
is moved, for example, robots are able to unarranged the formation and to redo 
it in front of a new location. 

Proxemic coordination is not intended to solve problems of coordination in- 
volving fine assembling, but it has been applied to a number of collective robotics 
applications, such as collective box-pushing and dynamic formations (for prelimi- 
nary results see [11]). 

7 Conclusions and Perspectives 

In this paper we have described proxemic coordination, a simple method to ena- 
ble mobile robots to coordinate their movements in order to reach a global spatial 
formation. Spatial coordination is important for solving problems of collective 
robotics such as material transportation and terrain coverage. 

The experiments described are in progress and future work will focus on 
defining more flexible proxemic spaces. We are working on various shapes of 
proxemic spaces, as well as on resizable proxemic spaces. 

We are also using the method of proxemic coordination in the design of the 
behavior of self-assembling robots. 
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Abstract. Probabilistic roadmap methods (PRM) have been success- 
fully applied in motion planning for robots with many degrees of free- 
dom. Many recent PRM approaches have demonstrated improved perfor- 
mance by concentrating samples in a nonuniform way. This work replace 
the random sampling by the deterministic one. We present several im- 
plementations of PRM-based planners (multiple-query, single-query and 
Lazy PRM) and lattice-based roadmaps. Deterministic sampling can be 
used in the same way than random sampling. Our work can be seen as 
an important part of the research in the uniform sampling field. Experi- 
mental results show performance advantages of our approach. 



1 Introduction 

The complexity of motion planning for robots with many degrees of freedom 
(more than 4 or 5) has led to the development of computational schemes that 
attempt to trade off completeness against time. One such scheme, randomized 
planning, avoids computing an explicit geometric representation of the free space 
T . Instead, it uses an efficient procedure to compute distances between bodies 
in the workspace. 

It samples the configuration space (CS) by selecting a number of configu- 
rations at random and retaining only the free configurations as nodes. It then 
checks if each pair of nodes can be connected by a collision-free path in configu- 
ration space. This computation yields the graph G = (R, E), called a probabilistic 
roadmap [1] , where V is the set of nodes and E is the set of pairs of nodes that 
have been connected. 

The default sampling approach for PRM planners samples T in & random 
way. Samples from the uniform distribution are usually obtained by generating 
the so-called pseudo-random numbers. Random sampling often generates clusters 
of points; in addition, gaps appears in the sample space. Recent research has 
focussed on designing efficient sampling and connection strategies [2] , [3] , [4] , [5] . 

Recent works replace the random sampling by the deterministic one [6], [7]. 
The work presented in [7] for non-holonomic motion planning, proposes the use 
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of other sequences, Sobol’, Faure, Niederreiter and generalized Halton. Lavalle 
and Branicky [6] only use two low-discrepancy sequences, the Halton and Ham- 
mersley. The sampling is seen as an optimization problem in which a set of points 
is chosen to optimize some criterion of uniformity. Also, deterministic sampling 
can be thought of as a sophisticated form of stratified sampling. 



2 Deterministic Sampling 



The historical origin of discrepancy theory is the theory of uniform distribu- 
tion developed by H. Weyl and other mathematicians in the early days of the 
20th century. While the latter deals with the uniformity of infinite sequences 
of points, the former deals with the uniformity of finite sequences. Finite se- 
quences always have some irregularity from the ideal uniformity due to their 
finiteness. Discrepancy is a mathematical notion for measuring such irregularity. 
Let X = [0,1]^^ C define a space over which to generate samples. Consider 
designing a set, P, of n d-dimensional sample points {xq, xi, . . . , cc„} in way that 
covers X. Let 7^ be a collection of subsets of X, called a range space. Let R&TZ 
denote one such subset. The formal definition of discrepancy is as follows: 



Dn{P,TZ) = sup 
ReTZ 



PnR 

n 



\{R) , 



( 1 ) 



where A denotes the Lebesgue measure on X and the supremum is taken over 
all axis-parallel boxes R. A detailed analysis of the discrepancy can be found in 
[8] and in the references therein. 

Similarly to the notion of discrepancy, it is possible to quantify the denseness 
of n points, the dispersion, which is defined by 



d„(P, (5) = max min 5(x,Xi), (2) 

x^X l<i<n 

It was introduced by Hlawka (1976) and later investigated in more general 
form in [8]. Above 5 denotes any metric. For any particular range space, the 
dispersion is clearly bounded by the discrepancy. The relation 

d„(P,<5) < (3) 

is established in [8] for the Euclidean metric. For the maximum metric one 
obtains 

d'^{P,5)<Dr,{P,nf/'^ (4) 

according to [8]. Thus every low-discrepancy point set (or sequence) is a low- 
dispersion point set (or sequence), but no conversely. 

Although dispersion has been given less consideration in the literature than 
discrepancy, and it is more suitable for motion planning. Dispersion has been 
deve-loped to bound optimization error; however, in PRM-based planners, it 
can be used to ensure that any corridor of a certain width will contain sufficient 
samples [6]. 
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2.1 Finite Point Sets and Sequences 

For practical use, there are three different types of low-discrepancy sequences or 
point sets : Halton sequences, lattice rules, and {t, fc)-sequences. The last one in- 
cludes almost all important sequences such as Sobol’sequences, Faure sequences, 
and Niederreiter-xing sequences between others. 

Many of the relevant low-discrepancy sequences are linked to the van der 
Corput sequence. The Halton sequence is a d dimensional generalization that 
uses van der Corput sequences of d different bases, one for each coordinate. 
The Hammersley point set is an adaptation of the Halton sequence, using only 
d — 1 distinct primes. A different approach was used by Sobol’ who suggested 
a multi-dimensional (t, s)-sequence using a base 2. The Sobol’ idea was further 
developed by Faure, who suggested alternative multi-dimensional (t, s)-sequences 
with base b > d. A general construction principle for (t, s)-sequences has been 
proposed by Niederreiter. We refer to [6], [7], [9] and [8] for the construction of 
low-discrepancy sequences. 

In general, slightly better distributions are possible if the number of sample 
points needed is known in advance. One effective way to obtaining such sets is 
to use the lattice point method. Good lattice point sets are an important kind 
of low-discrepancy points for multi-dimensional quadrature, simulation, experi- 
mental design, and other applications [10]. 

Let n be an integer > 2 and a = (ai, • • • , Ud) be an integer vector modulo n. 
A set of the form 

P„ = {{ak/n} = ({aik/n}, ■■■ , {adk/n}) jk= (5) 

is called a lattice point set, where {x} denotes the fractional part of x. The 
vector a is called a lattice point or generator of the set. 

The advantage of the lattice point set, as defined in (5), is that it has a simple 
form and is easy to program. A disadvantage is that the number of points, n, is 
fixed and the good lattice points, a, typically depend on n. This is the contrast 
case of (t, m, s)-nets, which can be extended in size by drawing them from a (t, s)- 
sequence [8]. This deficiency in lattice point sets can be overcome by replacing 
k/n in (5) by the van der Corput sequence. Tables a for extensible lattice point 
sets are given by Hickernell et al. [11]. 

We considered a particular type of lattice called the Bukharev grid, is cons- 
tructed for some n such that k = is an integer. X is decomposed into n 
cubes of width 1/k so that a tiling of/cxfcx x ■■■ x k is built. Classical grid 
places a vertex at origin of each region, the Sukharev grid places a vertex at the 
center of each region. 

2.2 Uniformity of Low-Discrepancy Sequences 

For numerical integration and many other purposes, Monte Carlo methods 
have been used for a long time. Newer developments replace the pseudo- 
random sequences by deterministic sequences (quasi-random or low-discrepancy 
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sequences). Although motion planning is an approach different to the integra- 
tion one, it is worth evaluating these carefully constructed and well analyzed 
samples. Their potential use in motion planning is no less reasonable than using 
pseudo-random sequences, which were also designed with a different intention in 
mind. 

We generated points of the sequences up to d = 100 and n = 100000 and 
observed their uniformities in two-dimensional planes selected at random, using 
Halton, Faure, Sobol’ and Niederreiter sequences. 

The Halton sequences give uniform distributions for lower dimensions (1 — 7). 
As the number of dimensions increases, the quality of this sequence rapidly 
decreases because two-dimensional planes within the hypercube are sampled in 
cycles with increasing periods. For dimensions larger than 8 the sample points 
generated by Halton sequence are ordered into lines. To avoid the line ordering 
for the points generated by Halton sequence, we used the generalized Halton 
sequences. The specific characteristics of the generalized Halton sequences are 
still unknown (a certain degree of local non-uniformity is introduced). 

The Faure sequence is an example of a less successful {t, s)-sequence, that 
gives a different distribution. While it achieves a high degree of local uniformity, 
the unit square projections of the hypercube are sampled in strips, and the new 
points fall into the vicinity of those generated previously. 

The Sobol’ sequence preserves its uniformity as d increases. 

The Niederreiter sequences are (t, s)-sequences defined for any base b > d, 
where 6 is a power of a prime number, they only make a light improvement on 
the Faure sequences in certain dimensions. 

These experiments consolidate two ideas: 1) With a larger base, a low- 
discrepancy sequence can present certain pathologies, and 2) the minimal size 
of a low-discrepancy sample that has better equidistribution properties than a 
pseudo-random sequence grows exponentially with the dimension [9] . 




Fig. 1. Projection of the first 1000 points of the Halton (d = 8) and Faure (d = 50) 
sequences. 
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3 Deterministic Roadmap Methods 

In the last few years, the PRM approach has been studied by many researchers 
[1], [2], [3], [4], [14]. This had led to a large number of variants of this approach, 
each one with its own merits. It is difficult to compare PRM variants, because it is 
hard to maintain congruence when each variant has been tested on different types 
of scenes, has been used different underlying libraries, and has been executed on 
different machines. 

The philosophy behind the classic PRM was to perform preprocessing (the 
learning phase) so that multiple- queries for the same environment could be han- 
dled efficiently. Once the PRM has been constructed, the query phase attempts 
to solve motion planning problems: qi and qg are treated as new nodes in the 
PRM, and connections are attempted. Then, standard graph search is performed 
to connect qi to qg. If the method fails, then either more nodes are needed in 
the PRM, or there is no solution. 

The classic PRM [1] was chosen because it eases the comparison between its 
deterministic variants; the samples from the pseudo-random number generator 
appear directly as nodes in the roadmap (except those in collision). 

We can consider two variants of the PRM, a deterministic roadmap, DRM, 
and a lattice roadmap^, LRM, by applying the deterministic sampling techniques 
described in Section II. 

The main problem with classical grid search is that too many points per axis 
are typically required. The PRM was proposed to reduce the exponential number 
of samples needed for this approach. If we generalize the grid to a lattice (which 
is essentially a nonorthogonal grid), one can consider this method as a special 
kind of the lattice roadmap. 

We used the following scenes (see Figure 2) to compare different deterministic 
sampling methods. All techniques were integrated in the MSL library (Univer- 
sity of Illinois) implemented in C-|— I- under Linux, and uses the PQP collision 
detection package from University of North Carolina. All experiments were run 
on a 866 Mhz Pentium 3 with 128 MB of internal memory. 

The number of nodes required to find a path that travels through the corri- 
dors is shown in Table 1 for all sampling strategies. We also compared two PRM 
variants, Gaussian and Visibility. 



Table 1. Comparisons of the number of nodes 



Prob. 


PRM 


DRM 


LRM 


GS 


VS 


2 narrows 


4316 


3175 


2322 


700 


35 


1 narrow 


2480 


1840 


1239 


656 


39 



^ The costly neighborhood structure is implicitly defined by the lattice rules. 
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Fig. 2. A 6-dof planning problem in which an object passes through two small opening; 
and an L-shaped object that must rotate to get through the hole. 




Fig. 3. A 8-dof mobile manipulator. 



Sanchez et al [7] have proposed the use of low-discrepancy sequences in the 
nonholonomic motion planning context. Figure 3 present an example for mobile 
manipulators. 

In all our experiments, pseudo-random numbers were generated using the li- 
near congruential and Mersenne Twister generator for PRM. We have used Hal- 
ton, generalized Halton, Hammersley, Faure, Sobol’ and Niederreiter sequences 
as inputs for DRM. 

We chose the best result among all deterministic sequences for DRM. The 
construction time is obviously smaller for the deterministic case; this is due to the 
fact that the number of nodes necessary to answer correctly a query is smaller. 
It is important to mention that in the case of very cluttered environments, the 
construction time is similar for both methods PRM and DRM. For the number 
of generated arcs and the calls to collision checking, the difference between the 
two is enormous. 

Analyzing the obtained results, we can affirm that the use of the deterministic 
sampling offers additional advantages: the number of nodes require to find a 
path is always inferior in the deterministic sampling case, the collision test calls 
diminishes considerably, as well as the number of configurations generated during 
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the construction phase. All these results confirm that the coverage of the free 
space is better using determinist sampling. 

These results confirm that the deterministic sampling approach does not 
require an important adaptation of the classic PRM algorithm. 



4 Hsu and SBL Planners 



While multi-query planners use a sampling strategy to cover the whole free-space, 
a single-query planner applies a strategy to explore the smallest portion of free- 
space needed to find a solution path. For example, see the planners presented in 
[14] and [15]. 

The planner in [14] constructs two trees of nodes (the roadmap) rooted at Qi 
and Qg respectively. It samples new configurations first in the neighborhoods of 
Qi and qg, and then iteratively, in the neighborhoods of newly-generated nodes. 
It stops as soon as the two trees become one connected component, and a path 
between qi and qg can be extracted from the roadmap. The current implementa- 
tion of the algorithm uses a fixed-size neighborhood around an existing node to 
sample new configurations. The size of neighborhoods has a big impact on the 
distribution of nodes. If the size is too small, the nodes tend to cluster around the 
initial and the goal configuration and leave large portions of the free space with 
no samples. If the size is very large, the samples likely distribute more evenly in 
the free space, but the rejection rate also increases significantly. 

The planner in [15] searches T by building a roadmap made of two trees of 
nodes, Ti and Tg. The root of Ti is the initial configuration and the root 
of Tg is the goal configuration qg (bi-directional search). Every new node gene- 
rated during planning is installed in either one of the two trees as the child of 
an already existing node. The link between the two nodes is the straight-line 
segment joining them in CS. This segment will be tested for collision only when 
it becomes necessary to perform this test to prove that a candidate path is 
collision- free (lazy collision checking). The planner is given two parameters: s - 
the maximum number of nodes that can generate and p, - a distance threshold. 
In this implementation p is set between 0.1 and 0.2. 

These algorithms can be derandomized by using a deterministic low- 
discrepan-cy sequences, such as generalized Halton, Sobol’, Faure, or Nieder- 
reiter. We simply replace random samples with deterministic ones. Table 2 
shows the results of experiments performed on two difficult scenes for articu- 
lated robots. The planners were implemented in Java. 

The first results obtained with the derandomization of these algorithms are 
very interesting. The run time is smaller in the case of the use of deterministic 
sampling, like the calls to collision checking. Also we noticed that the intrinsic 
parameters of the algorithms are not easy to choose. 
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Fig. 4. Two paths found by Hsu and SBL planners for a 11-dof and 8-dof manipulators. 
Table 2. Statistics for the two environments 



11-dof Prob. 


Hsu 


SBL 


Hsu/DRM 


SBL/DRM 


Nodes in roadmap 


1156 


6831 


805 


4416 


Nodes in path 


37 


61 


42 


65 


Running time 


17.77 


3.45 


9.46 


2.29 


Collision checks 


56877456 


5676136 


28966459 


4045502 


8-dof Prob. 


Hsu 


SBL 


Hsu/DRM 


SBL/DRM 


Nodes in roadmap 


447 


2343 


413 


2210 


Nodes in path 


33 


53 


26 


44 


Running time 


3.27 


0.84 


2.99 


0.78 


Collision checks 


8893118 


1560958 


8097380 


1277984 



5 Discussion 

Low-discrepancy samples were developed to perform better than random samples 
for numerical integration (using an inequality due to Koksma-Hlawka) . Low- 
dispersion samples were developed to perform better than random samples in 
numerical optimization (using an inequality due to Niederreiter) . We can obtain 
a bound that expresses the convergence rate in terms of dispersion and the width 
of the narrowest corridor in T . The corridor thickness appears to be a measure 
of difficulty. 

We used low-discrepancy sequences to bridge the gap between the flexibility 
of pseudo-random number generators and the advantages of a regular grid. They 
are designed to have a high level of uniformity in multi-dimensional space, but 
unlike pseudo-random numbers they are not statistically independent. The trou- 
ble with the grid approach is that it is necessary to decide in advance how fine 
it should be, and all the grid points need to be used. It is therefore not possible 
to sample until some convergence criterion has been met. Recently, Lindemann 
and Lavalle [16] proposed a new sampling method in the PRM framework. This 
grid sampling satisfies the desirable criteria uniformity, lattice structure and in- 
cremental quality. It is an arbitrary-dimensional generalization of the van der 
Corput sequence. 
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The results indicate an advantage of deterministic sampling, lattices, and 
grids over pseudo-random sampling, particularly in a lazy PRM approach. Cla- 
ssic PRM attempts to reduce the exponential number of samples needed for a 
grid-based approach by using random sampling. But, given the Sukharev cri- 
terion [8], we know that an exponential number of samples is needed in any 
case. Lazy PRM provides a link between grid search and classic PRM. DRM is 
definitively an improvement over the PRM by using deterministic sampling. 

The results reported in [12], [13] bound the number of nodes generated by 
probabilistic roadmap planners, under the assumption that the free space T 
satisfies some geometric properties. One such property, so called expansiveness, 
measures the difficulty caused by the presence of narrow passages. If T is expan- 
sive, the probability that a probabilistic roadmap planner fails to find a free path 
between two given configurations, tends exponentially to zero while the number 
of nodes increases. 

Deterministic sampling enables all deterministic roadmap planners to be re- 
solution complete (see [6] for more details), in the sense that if it is possible to 
solve the query at a given sampling resolution, they will solve it. The resolu- 
tion can be increased arbitrarily to ensure that any problem can be solved, if a 
solution exists. 

6 Conclusions 

We know that contemporary motion planning algorithms (many of which use 
randomization) are very efficient to solve many difficult problems. This can lead 
to the conclusion, that randomization is the key for its effectiveness. 

Although randomization can become a “black box” , which hides the reasons 
for success in an algorithm. Therefore, the attempts to derandomize popular 
motion planning algorithms do not reflect antipathy towards randomization, 
but rather the desire to understand fundamental insights of these algorithms. 

Hsu and SBL algorithms are properly partially-randomized versions, in which 
deterministic and randomized strategies are combined. We hope that more works 
that investigate partially-randomized and deterministic variants of contempo- 
rary motion planning algorithms will be considered in the future. As we have 
already mentioned, our work can be seen as part of the efforts proposed in [6] 
to derandomize PRMs. 

We will need to demonstrate that the performance of determinist sampling 
in dimension superior to 10 will be equivalent to that obtained in the case of 
inferior dimensions. In section 2,2 we discussed the uniformity of low-discrepancy 
sequences. 

Randomization is a common algorithmic technique, and it is of great value in 
many contexts. In the robot motion planning context, randomization is the most 
effective technique for reducing the high cost associated with moving objects with 
many degrees of freedom. The usefulness of randomization for this purpose has 
been challenged. 
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